[1,2,3]\fnmKeisuke \surFujii

1]\orgdivGraduate School of Informatics, \orgnameNagoya University, \orgaddress\streetChikusa-ku, \cityNagoya, \stateAichi, \countryJapan

2]\orgdivRIKEN Center for Advanced Intelligence Project, \orgname1-5, \orgaddress\streetYamadaoka, \citySuita, \stateOsaka, \countryJapan

3]\orgdivPRESTO, \orgnameJapan Science and Technology Agency, \orgaddress\cityKawaguchi, \stateSaitama,\countryJapan

Basketball-SORT: An Association Method for Complex Multi-object Occlusion Problems in Basketball Multi-object Tracking

\fnmQingrui \surHu [email protected]    \fnmAtom \surScott [email protected]    \fnmCalvin \surYeung [email protected]    [email protected] [ [ [
Abstract

Recent deep learning-based object detection approaches have led to significant progress in multi-object tracking (MOT) algorithms. The current MOT methods mainly focus on pedestrian or vehicle scenes, but basketball sports scenes are usually accompanied by three or more object occlusion problems with similar appearances and high-intensity complex motions, which we call complex multi-object occlusion (CMOO). Here, we propose an online and robust MOT approach, named Basketball-SORT, which focuses on the CMOO problems in basketball videos. To overcome the CMOO problem, instead of using the intersection-over-union-based (IoU-based) approach, we use the trajectories of neighboring frames based on the projected positions of the players. Our method designs the basketball game restriction (BGR) and reacquiring Long-Lost IDs (RLLI) based on the characteristics of basketball scenes, and we also solve the occlusion problem based on the player trajectories and appearance features. Experimental results show that our method achieves a Higher Order Tracking Accuracy (HOTA) score of 63.48%percent\%% on the basketball fixed video dataset and outperforms other recent popular approaches. Overall, our approach solved the CMOO problem more effectively than recent MOT algorithms.

keywords:
person re-identification, sports, computer vision, video processing

1 Introduction

Multi-object tracking (MOT) is a fundamental computer vision task that aims to track multiple objects in a video and localize them in each frame. It involves the simultaneous detection, localization, and tracking of multiple objects in a video sequence. It has become increasingly important in various fields such as autonomous driving [1], surveillance [2, 3], and sports analysis [4], most recent tracking algorithms which mainly focus on crowded street scenes [2, 3], static dancing [5], driving scenarios [1] and sports analysis [4].

Currently, the most popular approach for MOT is tracking-by-detection. The process of the tracking-by-detection MOT method [6, 7, 8] usually consists of the following steps: (1) Object detection: First, an object detection algorithm is used to detect the target in each frame. (2) Object association: The targets detected in consecutive frames are associated. The association is achieved by matching the target detected in the current frame with the previously tracked target, such as using the Kalman filter (KF), which is famous for this task. (3) Object tracking: The tracking process begins once the targets have been associated between frames. This involves estimating the trajectory and state of each target over time. These current mainstream methods [1, 9, 10] have achieved good results on several open-source datasets, which are primarily in pedestrian and driving scenarios. However, these methods do not perform well on challenging datasets, especially on some datasets with designed sports scenarios [11] since The tracking problem for sports scenarios is more complex. Nowadays, the demand for sports performance analysis in sports is growing, so the field of sports multi-target tracking needs more attention.

Unlike MOT for pedestrians or vehicles, team sports face more challenges due to a variety of reasons, including three or more object occlusions with similar appearance and high-intensity complex and unpredictable motions as illustrated in Figure 1, which we call Complex Multi-object Occlusion (CMOO) problems that frequently appear in basketball. Previous methods used appearance-motion fusion [12, 13] or simply motion-based methods [10, 14] to solve the association problems, but they did not focus on CMOO problems.

Refer to caption
Figure 1: An example of Complex Multi-object Occlusion (CMOO) problems with a sequence of (a) to (c), including three or more player occlusions with similar appearance of players’ jerseys in the same team and high-intensity complex and unpredictable motion. This can greatly affect detection and tracking accuracy and will cause a player to be lost and an ID switch.

In this paper, we propose a robust online MOT algorithm specifically designed for solving the CMOO problems, called Basketball-SORT, which is inspired by the SORT (Simple Online and Realtime Tracking) algorithm [13]. Among various team sport MOT datasets [15, 16, 11, 17], we use the basketball fixed camera dataset from the TeamTrack dataset [17] to validate our approach because it includes frequent CMOO problems. Our experimental results show that our algorithm effectively solves the CMOO problems and outperforms all other tracking algorithms on the basketball fixed camera dataset.

Our paper has three main contributions as follows. (1) We propose an online and robust MOT approach, named Basketball-SORT, which solves the CMOO problems, including three or more object occlusions with similar appearance of objects and high-intensity and unpredictable motions. (2) We have addressed the frequent occlusion problems in basketball scenes by utilizing the Re-Identification (ReID) features, positions, and velocities of players before and after occlusions. By incorporating BGR and RLLI, we can resolve the issue of players losing their IDs after being occluded for an extended period. (3) Experimental results show that our method achieves a Higher Order Tracking Accuracy (HOTA) score of 63.48%percent\%% on the basketball fixed video dataset and outperforms other recent popular approaches. Overall, our approach solved the CMOO problem more effectively than recent MOT algorithms.

2 Related work

2.1 Motion-based Multi-Object Tracking

Despite the significant advancements in object detection algorithms, many contemporary end-to-end MOT models still fall short in performance compared to traditional motion model-based tracking techniques. The Kalman filter [18] serves as the cornerstone for the most famous family of tracking-by-detection approaches. SORT [6] used linear motion mode with IoU (intersection-over-union)-associated motion trajectory. ByteTrack [10] used the low score detection frame to predict the missing pedestrians, achieving good performance by balancing the detection quality and tracking confidence. Recently, OC-SORT [14] improves correlation accuracy for nonlinear motion scenes using an observation-centered approach. Some methods utilize the bounding box distance as the cost of the association, while some recent works utilize different IoU computation methods, such as calculating the BIou (the buffer of two overlap** boxes) [19]. Another work iteratively expanding the IoU according to different scales of expansion (EIoU) [20] for the bounding box association between frames, which also demonstrates the effectiveness of MOT in SportsMOT [11] and SoccerNet-Tracking [15] dataset.

2.2 Appearance-based Multi-Object Tracking

Visual identification serves as an intuitive cue for associating targets over time. One of the pioneering methods to incorporate deep visual features for object association is DeepSORT [13]. Several approaches [13, 8] employ ReID models to extract embedding features from detections, which are then used for association. In recent years, the emergence of transformers [21] has sparked a new trend in utilizing appearance for MOT, where object association is formulated as a query-matching task [22, 23, 24]. However, it has been observed that appearance-based methods tend to be less effective when the objects of interest have similar visual characteristics [5] or are subject to occlusion [2], which is often the case in basketball scenarios.

2.3 Multi-Object Tracking in Sports

Multiple Object Tracking (MOT) in sports environments presents significantly greater challenges compared to other domains where MOT is applied. This can be attributed to the unique characteristics of sports, such as the rapid and unpredictable movements of athletes, the visual similarity among players within the same team, and the increased occurrence of occlusions due to the dynamic nature of the sport. Several researchers have made notable contributions to address these challenges in various sports. For instance, in hockey, Vats et al. [25] propose a method that integrates team classification and player identification techniques to enhance tracking performance. In football, Maglo et al. [26] showcase improved tracking accuracy by localizing the field and players. Moreover, Sangüesa et al. [27] leverage human pose information and actions as embedding features to boost the tracking performance of basketball players. Huang et al. [28] tackle multiple sports scenarios, including basketball, volleyball, and football, by combining OC-SORT with appearance-based post-processing. Our method primarily focuses on the CMOO problem in basketball. By incorporating CMOO’s characteristics, we aim to perform robust and accurate player tracking.

3 Methods

Our approach adopts the tracking-by-detection paradigm, which also enables online tracking without using future information. Our method has the following three steps, as shown in Figure 2. In Section 3.1, We first use a fine-tuning yolov8 model to detect all the players. We then project the players’ image position information onto the 2D court plane and associate all trajectories of each frame based on the players’ 2D position information. Second, in Section 3.2, we set the basketball game restriction (BGR) according to the rules of the basketball game and reacquire the tracking ID of the player in the Long-Lost state (RLLI) according to the trajectory characteristics of the player. Third, in Section 3.3, due to BGR and the RLLI, the ID-increasing problem is converted to the ID-switch problem. A player’s occlusion causes the ID-switch, and we use appearance features and motion features of trajectories to solve the ID-switch problem.

Refer to caption
Figure 2: The pipeline of our Basketball-SORT algorithm. The main technical contributions are BGR and RLLI. BGR: In a basketball game, there can only be 10-player motion trajectories. At the 100th frame, we calculate the 10 longest trajectories on the court and identify them as players. RLLI: For the Long-Lost state player, we compare the position and appearance information of the Long-Lost player with the new detection bounding box to determine whether the player tracking ID should be reacquired.

3.1 Trajectory tracking based on projected position

Among various team sport MOT problems [15, 16, 11, 17], we consider the basketball fixed camera problems from the TeamTrack dataset [17] to validate our approach because it includes frequent CMOO problems. Since the camera is fixed, we find the image coordinates of these basketball court key points in the video based on the 20 key points in the standard overhead view of the court as illustrated by the red dots in Figure 3, and then calculate the homography matrix by the following formula:

[xcycwc]=H[xiyi1]delimited-[]subscript𝑥𝑐subscript𝑦𝑐subscript𝑤𝑐𝐻delimited-[]subscript𝑥𝑖subscript𝑦𝑖1\left[\begin{array}[]{l}x_{c}\\ y_{c}\\ w_{c}\end{array}\right]=H\left[\begin{array}[]{l}x_{i}\\ y_{i}\\ 1\end{array}\right][ start_ARRAY start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] = italic_H [ start_ARRAY start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARRAY ] (1)

where (xc,yc)subscript𝑥𝑐subscript𝑦𝑐(x_{c},y_{c})( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) denotes the coordinates of the basketball court coordinate system, and (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the coordinates of the image coordinate system, wcsubscript𝑤𝑐w_{c}italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denotes the scale factor which can unifying projection transformations into matrixes. The homography matrix H𝐻Hitalic_H can be computed from the coordinates of the key points of the 20 standardized courts we manually input (Figure 3).

Refer to caption
Figure 3: Standard basketball court projection, We use the red dots in the figure above to convert the image coordinate system coordinates to the player’s position in the basketball court coordinate system

We use the bottom midpoint of the detection bounding box as the player’s image coordinate system position to convert it into a top-view projection position. As a result, we can get the relative coordinates of the players in the court’s plane, which could be a better help in tracking the player’s position in the court. The player is associated with each frame using the Kalman filter based on the position and speed of the projected position moving on the plane. We did not utilize the IoU association because of the frequent occurrence of CMOO where players are far apart in actual distance but have completely overlap** IoUs. The reason for not employing ReID for association is due to the appearance similarity caused by the players’ uniforms.

3.2 Basketball Game Restriction and Reacquiring Long-Lost IDs

In a basketball game, there is often severe occlusion of players due to offense and defense, which often leads to the lost player being assigned a new tracking ID after reappearing. In addition, due to the detector’s limited performance and the confusion between referees and players, some referees and the audience may be detected as players and have some redundant trajectories. To solve these problems, we added the basketball game restriction (BGR): because basketball games have 10 players, there will only be 10 tracking IDs in a game, solving the problem of the IDs increasing. We calculate the length of all trajectories at the designed frame and select the top ten trajectories with the most extended lengths in the court to be recognized as players. In this study, this designated frame is set at the 100th frame because referee and audience trajectories are shorter, and all 10 player trajectories are available at the 100th frame. These ten trajectories will be fixed, and no new trajectories will be generated.

When 10 players’ trajectories are determined, if a trajectory is occluded for less than B𝐵Bitalic_B frames, we define this trajectory as a Lost state. If a trajectory is lost for more than B𝐵Bitalic_B frames, we define this trajectory as a Long-Lost state. Because players usually play 1-on-1, the resulting trajectory states are often defined as Lost or Long-Lost states. For the Lost state player, we still use the KF to predict each frame’s real-time position and detection box matching. For the Long-Lost state players, we need to reacquire their tracking ID (RLLI). We record the player’s court position, and ReID features before the player loses. When a new detection reappears on the court, we reassign the player’s ID by comparing the appearance similarity and distance before disappearing. The appearance similarity is a vital clue for object association between frames, which can filter out some impossible associations. It can be calculated by the cosine similarity between the appearance features, and we will compute the appearance feature for each frame of the detection box. The cost for appearance association Cost𝐶𝑜𝑠𝑡{Cost}italic_C italic_o italic_s italic_t can be directly obtained from the cosine similarity with the following formula:

 Cost a_b=1 Cosine Similarity =1ababsubscript Cost 𝑎_𝑏1 Cosine Similarity 1𝑎𝑏norm𝑎norm𝑏\text{ Cost }_{a\_b}=1-\text{ Cosine Similarity }=1-\frac{a\cdot b}{\|a\|\|b\|}Cost start_POSTSUBSCRIPT italic_a _ italic_b end_POSTSUBSCRIPT = 1 - Cosine Similarity = 1 - divide start_ARG italic_a ⋅ italic_b end_ARG start_ARG ∥ italic_a ∥ ∥ italic_b ∥ end_ARG (2)

where a𝑎aitalic_a and b𝑏bitalic_b denote the appearance characteristics of the detection box for two specific frames, respectively. A higher cosine similarity indicates a higher appearance similarity, while a lower cosine similarity indicates that the appearance of the trajectory is different from the detection appearance. The reappeared detection of whether to rematch the lost trajectory needs to meet the following conditions:

{Costlost_re>αDistlost_re>βcases𝐶𝑜𝑠subscript𝑡𝑙𝑜𝑠𝑡_𝑟𝑒𝛼otherwise𝐷𝑖𝑠subscript𝑡𝑙𝑜𝑠𝑡_𝑟𝑒𝛽otherwise\begin{cases}{Cost}_{lost\_re}>\alpha\\ {Dist}_{lost\_re}>\beta\end{cases}{ start_ROW start_CELL italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_l italic_o italic_s italic_t _ italic_r italic_e end_POSTSUBSCRIPT > italic_α end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_D italic_i italic_s italic_t start_POSTSUBSCRIPT italic_l italic_o italic_s italic_t _ italic_r italic_e end_POSTSUBSCRIPT > italic_β end_CELL start_CELL end_CELL end_ROW (3)

where Costlost_re𝐶𝑜𝑠subscript𝑡𝑙𝑜𝑠𝑡_𝑟𝑒{Cost}_{lost\_re}italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_l italic_o italic_s italic_t _ italic_r italic_e end_POSTSUBSCRIPT denotes the cosine similarity between the lost trajectory’s appearance feature before disappearing and the reappeared detection appearance feature; Distlost_re𝐷𝑖𝑠subscript𝑡𝑙𝑜𝑠𝑡_𝑟𝑒Dist_{lost\_re}italic_D italic_i italic_s italic_t start_POSTSUBSCRIPT italic_l italic_o italic_s italic_t _ italic_r italic_e end_POSTSUBSCRIPT represents their Euclid distance; α𝛼\alphaitalic_α and β𝛽\betaitalic_β denote the thresholds set for similarity and distance respectively.

3.3 Solving Same and Different Team Player Occlusion Problem

In the previous SOTA tracking method [20, 4] in SportsMOT [11] dataset, when a trajectory is lost for more than a specific frame, it discards the trajectory, and when a new detection frame with no trajectory matches reappears, it initializes the detection box as a new trajectory, which is called the ID-increasing problem. Since we added BGR, the ID-increasing problem has become the ID switch problem. In addition, the long-time occlusion caused by the defense, as mentioned in the previous section, will also lead to the ID switch.

For the ID switch problem, first, we need to detect whether the player is occluded or not in each frame to record the occlusion state of the player. If the player’s trajectory is lost, it means the player has been occluded. When the player trajectory reappears, we record the two players closest to the player in the Lost state in the occluded frame, indicating that the occluded player may have had an ID switch with these two players. In addition, we recorded the projected court position history and the ReID features for each frame. Then, we need to categorize the occlusion into Same Team Occlusion (STO) and Different Team Occlusion (DTO) according to whether the two occluded players are on the same team or not. We can determine which situation the occlusion belongs to based on the ReID similarity of the two players before and after the occlusion. The formula is as follows:

Occlusion={STO,if|Ra_NbeforeRb_Nbefore|<γor|Ra_MafterRb_Mafter|<γDTO,if|Ra_NbeforeRb_Nbefore|>γor|Ra_MafterRb_Mafter|>γ𝑂𝑐𝑐𝑙𝑢𝑠𝑖𝑜𝑛cases𝑆𝑇𝑂𝑖𝑓superscriptsubscript𝑅𝑎_𝑁𝑏𝑒𝑓𝑜𝑟𝑒superscriptsubscript𝑅𝑏_𝑁𝑏𝑒𝑓𝑜𝑟𝑒𝛾𝑜𝑟superscriptsubscript𝑅𝑎_𝑀𝑎𝑓𝑡𝑒𝑟superscriptsubscript𝑅𝑏_𝑀𝑎𝑓𝑡𝑒𝑟𝛾otherwise𝐷𝑇𝑂𝑖𝑓superscriptsubscript𝑅𝑎_𝑁𝑏𝑒𝑓𝑜𝑟𝑒superscriptsubscript𝑅𝑏_𝑁𝑏𝑒𝑓𝑜𝑟𝑒𝛾𝑜𝑟superscriptsubscript𝑅𝑎_𝑀𝑎𝑓𝑡𝑒𝑟superscriptsubscript𝑅𝑏_𝑀𝑎𝑓𝑡𝑒𝑟𝛾otherwiseOcclusion=\begin{cases}STO,\>if\>\left|R_{a\_N}^{before}-R_{b\_N}^{before}% \right|<\gamma\>or\>\left|R_{a\_M}^{after}-R_{b\_M}^{after}\right|<\gamma\\ DTO,\>if\>\left|R_{a\_N}^{before}-R_{b\_N}^{before}\right|>\gamma\>or\>\left|R% _{a\_M}^{after}-R_{b\_M}^{after}\right|>\gamma\end{cases}italic_O italic_c italic_c italic_l italic_u italic_s italic_i italic_o italic_n = { start_ROW start_CELL italic_S italic_T italic_O , italic_i italic_f | italic_R start_POSTSUBSCRIPT italic_a _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_b _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT | < italic_γ italic_o italic_r | italic_R start_POSTSUBSCRIPT italic_a _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_b _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT | < italic_γ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_D italic_T italic_O , italic_i italic_f | italic_R start_POSTSUBSCRIPT italic_a _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_b _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT | > italic_γ italic_o italic_r | italic_R start_POSTSUBSCRIPT italic_a _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_b _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT | > italic_γ end_CELL start_CELL end_CELL end_ROW (4)

where Ra_Nbeforesuperscriptsubscript𝑅𝑎_𝑁𝑏𝑒𝑓𝑜𝑟𝑒R_{a\_N}^{before}italic_R start_POSTSUBSCRIPT italic_a _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT denotes the ReID features of player a𝑎aitalic_a for N𝑁Nitalic_N frames before occlusion occurs, and Ra_Maftersuperscriptsubscript𝑅𝑎_𝑀𝑎𝑓𝑡𝑒𝑟R_{a\_M}^{after}italic_R start_POSTSUBSCRIPT italic_a _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT denotes its ReID features for M𝑀Mitalic_M frames after occlusion occurs. The same can be applied to player b𝑏bitalic_b, γ𝛾\gammaitalic_γ denotes the similarity threshold between two players before and after occlusion, which is used to distinguish whether the occlusion is STO or DTO.

Since we record the player’s position, the ReID features, and the occlusion state information, we compare whether the IDs of the two players after occlusion are correctly assigned after the M𝑀Mitalic_M frame. For DTO, the different team uniforms will have lower appearance similarity, so we compare the ReID feature similarity of two trajectories before and after occlusion. If the ReID feature of one trajectory after occlusion is similar to the other one before occlusion, it is considered that an ID switch has occurred, and the formula is as follows:

{|Ra_NbeforeRb_Mafter|<δ|Ra_NafterRb_Mbefore|<δcasessuperscriptsubscript𝑅𝑎_𝑁𝑏𝑒𝑓𝑜𝑟𝑒superscriptsubscript𝑅𝑏_𝑀𝑎𝑓𝑡𝑒𝑟𝛿otherwisesuperscriptsubscript𝑅𝑎_𝑁𝑎𝑓𝑡𝑒𝑟superscriptsubscript𝑅𝑏_𝑀𝑏𝑒𝑓𝑜𝑟𝑒𝛿otherwise\begin{cases}\left|R_{a\_N}^{before}-R_{b\_M}^{after}\right|<\delta\\ \left|R_{a\_N}^{after}-R_{b\_M}^{before}\right|<\delta\end{cases}{ start_ROW start_CELL | italic_R start_POSTSUBSCRIPT italic_a _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_b _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT | < italic_δ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL | italic_R start_POSTSUBSCRIPT italic_a _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_b _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT | < italic_δ end_CELL start_CELL end_CELL end_ROW (5)

where Ra_Nbeforesuperscriptsubscript𝑅𝑎_𝑁𝑏𝑒𝑓𝑜𝑟𝑒R_{a\_N}^{before}italic_R start_POSTSUBSCRIPT italic_a _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT denotes the ReID feature of N𝑁Nitalic_N frames before the occlusion of player a𝑎aitalic_a, Rb_Maftersuperscriptsubscript𝑅𝑏_𝑀𝑎𝑓𝑡𝑒𝑟R_{b\_M}^{after}italic_R start_POSTSUBSCRIPT italic_b _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT denotes the ReID feature of M𝑀Mitalic_M frames after the occlusion of player b𝑏bitalic_b, δ𝛿\deltaitalic_δ denotes the threshold of appearance similarity of two players. Because two players typically do not change much in their appearance features before and after occlusion occurs, Satisfying the Eq. (5) means that these two players have wrongly assigned their IDs after the occlusion and need to exchange their IDs.

For STO, the same team uniforms will have a higher degree of appearance similarity, so we compare the moving distance and speed of two trajectories before and after occlusion. If the moving distance and velocity of one trajectory after occlusion are similar to the other one before occlusion, it is considered that an ID switch has occurred. The formulae are as follows:

{|Va_NbeforeVa_Mafter|>ε|Vb_NbeforeVb_Mafter|>ε|(PCoccPa_Nbefore)||(Pa_MafterPCocc)|>ζ|(PCoccPb_Nbefore)||(Pb_MafterPCocc)|>ζcasessuperscriptsubscript𝑉𝑎_𝑁𝑏𝑒𝑓𝑜𝑟𝑒superscriptsubscript𝑉𝑎_𝑀𝑎𝑓𝑡𝑒𝑟𝜀otherwisesuperscriptsubscript𝑉𝑏_𝑁𝑏𝑒𝑓𝑜𝑟𝑒superscriptsubscript𝑉𝑏_𝑀𝑎𝑓𝑡𝑒𝑟𝜀otherwisesuperscriptsubscript𝑃𝐶𝑜𝑐𝑐superscriptsubscript𝑃𝑎_𝑁𝑏𝑒𝑓𝑜𝑟𝑒superscriptsubscript𝑃𝑎_𝑀𝑎𝑓𝑡𝑒𝑟superscriptsubscript𝑃𝐶𝑜𝑐𝑐𝜁otherwisesuperscriptsubscript𝑃𝐶𝑜𝑐𝑐superscriptsubscript𝑃𝑏_𝑁𝑏𝑒𝑓𝑜𝑟𝑒superscriptsubscript𝑃𝑏_𝑀𝑎𝑓𝑡𝑒𝑟superscriptsubscript𝑃𝐶𝑜𝑐𝑐𝜁otherwise\begin{cases}\left|V_{a\_N}^{before}-V_{a\_M}^{after}\right|>\varepsilon\\ \left|V_{b\_N}^{before}-V_{b\_M}^{after}\right|>\varepsilon\\ \left|(P_{C}^{occ}-P_{a\_N}^{before})\right|-\left|(P_{a\_M}^{after}-P_{C}^{% occ})\right|>\zeta\\ \left|(P_{C}^{occ}-P_{b\_N}^{before})\right|-\left|(P_{b\_M}^{after}-P_{C}^{% occ})\right|>\zeta\\ \end{cases}{ start_ROW start_CELL | italic_V start_POSTSUBSCRIPT italic_a _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT - italic_V start_POSTSUBSCRIPT italic_a _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT | > italic_ε end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL | italic_V start_POSTSUBSCRIPT italic_b _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT - italic_V start_POSTSUBSCRIPT italic_b _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT | > italic_ε end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL | ( italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_c italic_c end_POSTSUPERSCRIPT - italic_P start_POSTSUBSCRIPT italic_a _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT ) | - | ( italic_P start_POSTSUBSCRIPT italic_a _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT - italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_c italic_c end_POSTSUPERSCRIPT ) | > italic_ζ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL | ( italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_c italic_c end_POSTSUPERSCRIPT - italic_P start_POSTSUBSCRIPT italic_b _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT ) | - | ( italic_P start_POSTSUBSCRIPT italic_b _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT - italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_c italic_c end_POSTSUPERSCRIPT ) | > italic_ζ end_CELL start_CELL end_CELL end_ROW (6)

where Va_Nbefore=(|PCoccPa_Nbefore|)/(|FCFN|)superscriptsubscript𝑉𝑎_𝑁𝑏𝑒𝑓𝑜𝑟𝑒superscriptsubscript𝑃𝐶𝑜𝑐𝑐superscriptsubscript𝑃𝑎_𝑁𝑏𝑒𝑓𝑜𝑟𝑒subscript𝐹𝐶subscript𝐹𝑁V_{a\_N}^{before}=(\left|P_{C}^{occ}-P_{a\_N}^{before}\right|)/(\left|F_{C}-F_% {N}\right|)italic_V start_POSTSUBSCRIPT italic_a _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT = ( | italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_c italic_c end_POSTSUPERSCRIPT - italic_P start_POSTSUBSCRIPT italic_a _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT | ) / ( | italic_F start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT - italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | ) and Va_Mafter=(|Pa_MafterPCocc|)/(|FMFC|)superscriptsubscript𝑉𝑎_𝑀𝑎𝑓𝑡𝑒𝑟superscriptsubscript𝑃𝑎_𝑀𝑎𝑓𝑡𝑒𝑟superscriptsubscript𝑃𝐶𝑜𝑐𝑐subscript𝐹𝑀subscript𝐹𝐶V_{a\_M}^{after}=(\left|P_{a\_M}^{after}-P_{C}^{occ}\right|)/(\left|F_{M}-F_{C% }\right|)italic_V start_POSTSUBSCRIPT italic_a _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT = ( | italic_P start_POSTSUBSCRIPT italic_a _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT - italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_c italic_c end_POSTSUPERSCRIPT | ) / ( | italic_F start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT - italic_F start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT | ). PCoccsuperscriptsubscript𝑃𝐶𝑜𝑐𝑐P_{C}^{occ}italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_c italic_c end_POSTSUPERSCRIPT denotes that only one player’s position can be detected during occlusion, FCFNsubscript𝐹𝐶subscript𝐹𝑁F_{C}-F_{N}italic_F start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT - italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT denotes the number of frames from N𝑁Nitalic_N to C𝐶Citalic_C, FMFCsubscript𝐹𝑀subscript𝐹𝐶F_{M}-F_{C}italic_F start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT - italic_F start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT denotes the number of frames from C𝐶Citalic_C to M𝑀Mitalic_M. Va_Nbeforesuperscriptsubscript𝑉𝑎_𝑁𝑏𝑒𝑓𝑜𝑟𝑒V_{a\_N}^{before}italic_V start_POSTSUBSCRIPT italic_a _ italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_e italic_f italic_o italic_r italic_e end_POSTSUPERSCRIPT, Va_Maftersuperscriptsubscript𝑉𝑎_𝑀𝑎𝑓𝑡𝑒𝑟V_{a\_M}^{after}italic_V start_POSTSUBSCRIPT italic_a _ italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_f italic_t italic_e italic_r end_POSTSUPERSCRIPT denotes the velocity of player a𝑎aitalic_a and b𝑏bitalic_b at N𝑁Nitalic_N frames before occlusion and M𝑀Mitalic_M frames after occlusion, respectively. V𝑉Vitalic_V denotes the velocity and the unit is cm/frame, and P𝑃Pitalic_P denotes the position. ε𝜀\varepsilonitalic_ε and ζ𝜁\zetaitalic_ζ represent the thresholds for the velocity and position of two players, respectively. Since both players don’t typically make large changes in velocity and position at the same time when the occlusion occurs, satisfying the Eq. (6) means that these two players could have wrongly assigned their IDs after the occlusion and need to exchange their IDs.

4 Experiments

4.1 Experimental setup

Datasets. We used the basketball fixed camera videos from the TeamTrack dataset [17] to validate our approach because it is considered to include frequent CMOO problems among various team sport MOT datasets [15, 16, 11, 17]. Here, we briefly introduce the basketball fixed video dataset, which was filmed from the side view using a fish-eye camera. The published videos were calibrated in advance and segmented into 30-second intervals. We split it into training and test videos with 162,353 and 170,529 frames, respectively. In the dataset, the players and referees were labeled, which allows our detector to detect only the players and filter out the referees.

Metrics. Multiple object tracking accuracy (MOTA) [29] have been often used as an evaluation metric for MOT tasks. However, MOTA focuses more on detection performance rather than association accuracy. In basketball, we track mainly to get the player’s trajectory, so we need to use HOTA [30] as the evaluation metric. HOTA consists of detection accuracy (DetA), localization accuracy (LocA), and association accuracy (AssA), the metrics combined to evaluate detection accuracy and tracking accuracy. The metrics combine different aspects of the MOT and HOTA to reflect the system’s performance more comprehensively. In addition, false positives (FP), false negatives (FN), and ID switches (IDS) can also reflect the tracking performance well. Therefore, we use HOTA, DetA, LocA, AssA, FP, FN, and IDS as the evaluation metrics of tracking.

Detector and ReID. In our method and baselines, we chose YOLOv8 [31] as our object detector to achieve real-time and high-accuracy detection performance. We used the officially provided yolov8l pre-training model to train our model with the training dataset. We used an SGD optimizer with a weight decay of 0.0005 and momentum of 0.9. The initial learning rate is 0.001 and training for 200 epochs. For player ReID, we use the proposed FastReID model [32] because we use ReID mainly to solve the occlusion problem, and the performance of the ReID model does not have high performance. All the experiments are conducted on a single Nvidia RTX 4080 GPU.

Implementation details. The threshold for a detection to be treated as a high-score detection is 0.6. Detections with a confidence score between 0.6 and 0.1 will be treated as low-score detections, and the rest with a confidence score lower than 0.1 will be filtered. We performed hyperparameter optimization on our method and found the best threshold. RLLI threshold α𝛼\alphaitalic_α was set to 0.2 and β𝛽\betaitalic_β set to 260; Distinguishing between STO and DTO parameters γ𝛾\gammaitalic_γ was set to 0.2; DTO appearance similarity threshold δ𝛿\deltaitalic_δ was set to 0.2; DTO velocity similarity threshold ε𝜀\varepsilonitalic_ε was set to 3 and position threshold ζ𝜁\zetaitalic_ζ was set to 3.

4.2 Benchmark Results

Among current MOT algorithms, BoT-SORT [8] demonstrates the most representative performance on large-scale pedestrian datasets such as MOT20 [2]. On the other hand, Deep-EIoU [20] achieves the best results on sports sense datasets like SportsMOT [11]. Therefore, we compare our method with recent methods such as BoT-SORT and Deep-EIoU on the basketball fixed video dataset, We have also included an example, as depicted in Figure 4. The results are shown in Table 1. In these methods, we employed the same detector, YOLOv8, for all experiments. The results show that DetA and LocA were quite similar, fluctuating within 1 percentage point. Therefore, we will omit the DetA and LocA results in the subsequent sections. The results demonstrate that our method achieves the best HOTA score due to a significant improvement in AssA. By utilizing projected positions to associate player trajectories, our approach can correctly track players even when their projected positions are far apart but their detection bounding boxes overlap. Since this situation occurs frequently, it leads to a substantial reduction in IDS, resulting in a notable boost in AssA. When we employ BGR, we do not discard or create new trajectories. Not discarding trajectories prevents some lost player trajectories from requiring their correct IDs, resulting in an increase in False Negatives (FN). We also show it in the ablation study below. On the other hand, not creating new trajectories means that some detection bounding boxes of referees and spectators will not be recognized as player trajectories, thereby reducing False Positives (FP). In addition, our method yields only 10 player trajectories, which is significantly better than other methods. Our method achieves the best results and outperforms all the other previous trackers while kee** the tracking process online, showing our algorithm’s effectiveness in MOT for the basketball fixed video dataset.

Method HOTA (%) AssA (%) FN FP IDS
BoT-SORT 59.70±6.43 53.43±10.05 270.8±163.3 492.6±369.9 23.2±13.7
Deep-EIoU 60.74±7.72 55.47±11.62 265.0±162.4 486.2±368.1 20.7±10.3
Basketball-SORT(ours) 63.35±7.21 59.15±10.61 387.7±464.9 118.9±191.8 9.3±5.0
Table 1: The performance comparison among different MOT algorithms on the basketball fixed video test sets. Our algorithm outperforms all the other previous tracking algorithms and achieves the best performance result in several major evaluation metrics. BoT-SORT and DeepEIoU are evaluated based on their official code [8, 20].
Refer to caption
Figure 4: (a-c) show the results using the DeepEIoU method. During occlusion, the player ID is lost, and the method fails to assign the player ID correctly. (d-f) demonstrate the results using the Basketball-SORT approach, which successfully matches the player ID during complete occlusion.

4.3 Ablation Studies

In this experiment, we evaluated Basketball-SORT (as the full model) on the test set of the basketball fixed camera dataset using different settings, including whether to use BGR, RLLI, and whether to employ STO and DTO to address occlusion issues (where “mix” indicates using both STO and DTO simultaneously). In addition, the “Projected” method, as mentioned in Section 3.1, only utilizes the projection positions to associate the trajectories of players. The results are shown in Table 2. We can see that after adding BGR, the FP and FN decrease dramatically, which may be due to the insufficient amount of data resulting in some audiences being recognized as players in the game. The reasons for the increase in FN and decrease in FP caused by BGR are the same as in Section 4.2. STO and DTO lead to an increase in IDS because our approach to resolving occlusion issues involves identifying the occlusion after it occurs and reassigning the correct ID to the player, which also results in an additional IDS, thereby increasing the overall IDS count. RLLI, STO, and DTO have all successfully improved the HOTA, indicating that occlusion problems in the sports scene occur frequently. The results show that the STO+RLL method has the highest HOTA, probably because more of the occlusions in the dataset are STO, and some DTO are likely to be misclassified as STO.

Method BGR RLLI STO DTO HOTA (%) AssA (%) FN FP IDS
Projected 61.96±6.99 57.32±10.07 239.6±142.7 532.2±409.7 20.1±10.3
BGR 63.12±7.55 59.23±9.89 491.2±491.1 110.9±162.0 8.6±5.2
RLLI 62.35±7.02 58.50±9.59 500.2±425.9 156.5±194.4 8.9±5.1
STO 63.45±7.74 59.57±10.67 442.9±527.8 105.6±151.4 8.9±5.1
STO + RLLI 63.48±7.10 59.37±10.29 386.6±465 117.5±192.5 9.0±4.8
DTO 63.06±7.34 59.11±9.97 492.9±528.0 106.9±161.0 9.6±5.4
DTO + RLLI 63.18±6.75 58.74±9.80 386.5±464.1 117.7±193.3 9.4±5.2
mix 63.21±7.84 59.91±11.01 492.7±527.2 105.1±161.0 9.7±5.5
Full model 63.35±7.21 59.15±10.61 387.7±464.9 118.9±191.8 9.3±5.0
Table 2: We evaluate the Basketball-SORT algorithm with different settings on the basketball video test set. Including using the Basketball Game Restriction (BGR), Reacquiring Long-Lost IDs (RLLI), same team occlusion (STO), and different team occlusion (DTO) for our method.

4.4 Robustness different movement velocity

In basketball, the movement speed of players varies dynamically and is unpredictable. When a player is occluded, the KF still predicts its possible position in the next frame using the match threshold. However, due to the instability of the player’s velocity, using a fixed association threshold may result in the athlete being unable to rematch with a newly appeared detection bounding box after the occlusion. To address this issue, we introduce the RLLI reacquire threshold Distlost_re𝐷𝑖𝑠subscript𝑡𝑙𝑜𝑠𝑡_𝑟𝑒{Dist}_{lost\_re}italic_D italic_i italic_s italic_t start_POSTSUBSCRIPT italic_l italic_o italic_s italic_t _ italic_r italic_e end_POSTSUBSCRIPT as mentioned in section 3.2, which calculates the distance between the position of the Long-Lost trajectory and the reappearance of the detection frame. It can effectively reduce the probability that the Long-Lost player’s trajectory cannot be re-matched with the new detection frame. In order to prove the robustness of our method, we use the RLLI model in Table 2 to test the effect of different match thresholds and Distlost_re𝐷𝑖𝑠subscript𝑡𝑙𝑜𝑠𝑡_𝑟𝑒{Dist}_{lost\_re}italic_D italic_i italic_s italic_t start_POSTSUBSCRIPT italic_l italic_o italic_s italic_t _ italic_r italic_e end_POSTSUBSCRIPT on the results as shown in Table 3. Match threshold represents the change in position due to velocity variations between adjacent frames. It is usually below 200 (cm), but in fast break situations, it may reach up to 300 (cm). Therefore, we chose this range of variations to test our method. Distlost_re𝐷𝑖𝑠subscript𝑡𝑙𝑜𝑠𝑡_𝑟𝑒{Dist}_{lost\_re}italic_D italic_i italic_s italic_t start_POSTSUBSCRIPT italic_l italic_o italic_s italic_t _ italic_r italic_e end_POSTSUBSCRIPT represents the distance between a Long-Lost player’s trajectory and the reappeared detection bounding box within B𝐵Bitalic_B frames. We have set this parameter to a range of 170-250 (cm), as it corresponds to the most likely distance a Long-Lost player might move within the given B𝐵Bitalic_B frames. Since we have defined the court dimensions as 2800 cm × 1400 cm, the units for the Match threshold and Distlost_re𝐷𝑖𝑠subscript𝑡𝑙𝑜𝑠𝑡_𝑟𝑒{Dist}_{lost\_re}italic_D italic_i italic_s italic_t start_POSTSUBSCRIPT italic_l italic_o italic_s italic_t _ italic_r italic_e end_POSTSUBSCRIPT parameters are in centimeters (cm). The results show that different parameters give similar results of average HOTA 61.81%percent61.8161.81\%61.81 %. This proves our method’s effectiveness in the real world, where ground truth is often not available and the tracking parameter can not be tuned.

Match threshold in KF
200 220 240 260 280 300
Distlost_re𝐷𝑖𝑠subscript𝑡𝑙𝑜𝑠𝑡_𝑟𝑒{Dist}_{lost\_re}italic_D italic_i italic_s italic_t start_POSTSUBSCRIPT italic_l italic_o italic_s italic_t _ italic_r italic_e end_POSTSUBSCRIPT 150 61.62 61.54 61.69 61.88 61.48 61.31
170 61.89 61.71 61.87 61.90 61.34 61.48
190 61.88 61.71 61.94 61.99 61.33 61.51
210 61.84 61.93 61.90 62.34 61.75 61.48
230 61.90 61.96 61.93 62.34 61.76 61.48
250 61.91 61.96 61.93 62.35 61.78 61.48
Table 3: We tested different match thresholds, and RLLI reacquires thresholds in the basketball fixed video test set. Each row parameter of the table represents different Distlost_re𝐷𝑖𝑠subscript𝑡𝑙𝑜𝑠𝑡_𝑟𝑒{Dist}_{lost\_re}italic_D italic_i italic_s italic_t start_POSTSUBSCRIPT italic_l italic_o italic_s italic_t _ italic_r italic_e end_POSTSUBSCRIPT, and each column parameter represents a different match threshold in KF. The values in the cells represent the HOTA, and various parameter configurations yield similar HOTA scores, indicating the robustness of the tracking method across different settings.

5 Conclusion

In this paper, we propose Basketball-SORT, an online and robust MOT approach that solves the CMOO problems in basketball videos. To overcome the CMOO problem, we used the trajectories of neighboring frames based on the projected positions of the players. Our method designs the BGR and RLLI based on the characteristics of basketball scenes, and we also solved the occlusion problem based on the player trajectories and appearance features. Experimental results show that our approach can effectively solve the CMOO problem and is much better than previous tracking algorithms. For future work, tracking basketball games filmed with moving cameras and addressing the more complex player occlusion issues in basketball scenes are needed in order to fully track the motion trajectories of athletes in more general ways.

Acknowledgments

This work was financially supported by JSPS Grant Number 20H04075 and 23H03282 and JST PRESTO Grant Number JPMJPR20CA.

Declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Data availability statements

The evaluation dataset in this study is available at https://github.com/AtomScott/TeamTrack.

Compliance with Ethical Standards

The participants were fully informed about the study, and their consent was obtained in advance. All the experimental procedures were performed after obtaining prior approval from the ethical committee at Tokai University.

References

  • \bibcommenthead
  • Geiger et al. [2012] Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361 (2012). IEEE
  • Dendorfer et al. [2020] Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L.: Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 (2020)
  • Milan et al. [2016] Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)
  • Wang et al. [2022] Wang, J., Peng, Y., Yang, X., Wang, T., Zhang, Y.: Sportstrack: an innovative method for tracking athletes in sports scenes. arXiv preprint arXiv:2211.07173 (2022)
  • Sun et al. [2022] Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., Luo, P.: Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20993–21002 (2022)
  • Bewley et al. [2016] Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468 (2016). IEEE
  • Peng et al. [2020] Peng, J., Wang, T., Lin, W., Wang, J., See, J., Wen, S., Ding, E.: Tpm: Multiple object tracking with tracklet-plane matching. Pattern Recognition 107, 107480 (2020)
  • Aharon et al. [2022] Aharon, N., Orfaig, R., Bobrovsky, B.-Z.: Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651 (2022)
  • Liang et al. [2022] Liang, C., Zhang, Z., Zhou, X., Li, B., Zhu, S., Hu, W.: Rethinking the competition between detection and reid in multiobject tracking. IEEE Transactions on Image Processing 31, 3182–3196 (2022)
  • Zhang et al. [2022] Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box. In: European Conference on Computer Vision, pp. 1–21 (2022). Springer
  • Cui et al. [2023] Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: Sportsmot: A large multi-object tracking dataset in multiple sports scenes. arXiv preprint arXiv:2304.05170 (2023)
  • Zhang et al. [2021] Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision 129, 3069–3087 (2021)
  • Wojke et al. [2017] Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649 (2017). IEEE
  • Cao et al. [2023] Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric sort: Rethinking sort for robust multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9686–9696 (2023)
  • Cioppa et al. [2022] Cioppa, A., Giancola, S., Deliege, A., Kang, L., Zhou, X., Cheng, Z., Ghanem, B., Van Droogenbroeck, M.: Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3491–3502 (2022)
  • Scott et al. [2022] Scott, A., Uchida, I., Onishi, M., Kameda, Y., Fukui, K., Fujii, K.: Soccertrack: A dataset and tracking algorithm for soccer with fish-eye and drone videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3569–3579 (2022)
  • Scott et al. [2024] Scott, A., Uchida, I., Ding, N., Umemoto, R., Bunker, R., Kobayashi, R., Koyama, T., Onishi, M., Kameda, Y., Fujii, K.: Teamtrack: A dataset for multi-sport multi-object tracking in full-pitch videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
  • Kalman et al. [1960] Kalman, R.E., et al.: Contributions to the theory of optimal control. Boletin Sociedad Matematica Mexicana 5(2), 102–119 (1960)
  • Yang et al. [2023] Yang, F., Odashima, S., Masui, S., Jiang, S.: Hard to track objects with irregular motions and similar appearances? make it easier by buffering the matching space. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4799–4808 (2023)
  • Huang et al. [2024] Huang, H.-W., Yang, C.-Y., Sun, J., Kim, P.-K., Kim, K.-J., Lee, K., Huang, C.-I., Hwang, J.-N.: Iterative scale-up expansioniou and deep features association for multi-object tracking in sports. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 163–172 (2024)
  • Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
  • Cao et al. [2022] Cao, J., Wu, H., Kitani, K.: Track targets by dense spatio-temporal position encoding. arXiv preprint arXiv:2210.09455 (2022)
  • Zeng et al. [2022] Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: Motr: End-to-end multiple-object tracking with transformer. In: European Conference on Computer Vision, pp. 659–675 (2022). Springer
  • Sun et al. [2020] Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., Luo, P.: Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020)
  • Vats et al. [2023] Vats, K., Walters, P., Fani, M., Clausi, D.A., Zelek, J.S.: Player tracking and identification in ice hockey. Expert Systems with Applications 213, 119250 (2023)
  • Maglo et al. [2022] Maglo, A., Orcesi, A., Pham, Q.-C.: Efficient tracking of team sport players with few game-specific annotations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3461–3471 (2022)
  • Sangüesa et al. [2019] Sangüesa, A.A., Ballester, C., Haro, G.: Single-camera basketball tracker through pose and semantic feature fusion. CoRR, abs/1906.02042 (2019)
  • Huang et al. [2023] Huang, H.-W., Yang, C.-Y., Ramkumar, S., Huang, C.-I., Hwang, J.-N., Kim, P.-K., Lee, K., Kim, K.: Observation centric and central distance recovery for athlete tracking. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 454–460 (2023)
  • Bernardin and Stiefelhagen [2008] Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing 2008, 1–10 (2008)
  • Luiten et al. [2021] Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., Leibe, B.: Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision 129, 548–578 (2021)
  • Jocher et al. [2023] Jocher, G., Chaurasia, A., Qiu, J.: YOLO by Ultralytics (2023). https://github.com/ultralytics/ultralytics
  • He et al. [2023] He, L., Liao, X., Liu, W., Liu, X., Cheng, P., Mei, T.: Fastreid: A pytorch toolbox for general instance re-identification. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 9664–9667 (2023)