StreamMOTP: Streaming and Unified Framework for Joint 3D Multi-Object Tracking and Trajectory Prediction

Jiaheng Zhuang1, Guoan Wang2, Siyu Zhang2, Xiyang Wang2,
Hangning Zhou2, Ziyao Xu2, Chi Zhang2, Zhiheng Li1
1 are with Tsinghua University, China.2 are with Mach Drive, China.Email:[email protected]
Abstract

3D multi-object tracking and trajectory prediction are two crucial modules in autonomous driving systems. Generally, the two tasks are handled separately in traditional paradigms and a few methods have started to explore modeling these two tasks in a joint manner recently. However, these approaches suffer from the limitations of single-frame training and inconsistent coordinate representations between tracking and prediction tasks. In this paper, we propose a streaming and unified framework for joint 3D Multi-Object Tracking and trajectory Prediction (StreamMOTP) to address the above challenges. Firstly, we construct the model in a streaming manner and exploit a memory bank to preserve and leverage the long-term latent features for tracked objects more effectively. Secondly, a relative spatio-temporal positional encoding strategy is introduced to bridge the gap of coordinate representations between the two tasks and maintain the pose-invariance for trajectory prediction. Thirdly, we further improve the quality and consistency of predicted trajectories with a dual-stream predictor. We conduct extensive experiments on popular nuSences dataset and the experimental results demonstrate the effectiveness and superiority of StreamMOTP, which outperforms previous methods significantly on both tasks. Furthermore, we also prove that the proposed framework has great potential and advantages in actual applications of autonomous driving.

Refer to caption
Figure 1: Different pipelines for the tasks of multi-object tracking and trajectory prediction in autonomous driving. (a) Cascade paradigm, where the two tasks are performed separately with non-differentiable transitions. (b) Joint single-frame paradigm, where the two tasks are performed jointly in a parallelized framework per frame. (c) The proposed StreamMOTP, where the memory, feature, and gradient are propagated across consecutive frames to enhance the long-term modeling ability and temporal consistency.

I INTRODUCTION

In autonomous driving systems, 3D Multi-Object Tracking (MOT) [1, 2, 3, 4, 5, 6] and trajectory prediction [7, 8, 9, 10, 11, 12, 13, 14] are two crucial tasks which play a vital role in ensuring the driving performance of ego-vehicle. Obviously, high-precision tracking can provide a more solid foundation for prediction, and in turn, accurate predictions can enhance the effectiveness of tracking. As depicted in Fig.1 (a), the two tasks are executed one after another in current mainstream pipelines of autonomous driving. Although this paradigm has achieved some success, the separated processing flow can not fully exploit the potential complementarity between the tasks of tracking and prediction, since it suffers from information loss, feature misalignment, and error accumulation across modules [15]. Despite some methods [16, 17, 18] attempt to integrate the two tasks as shown in Fig.1 (b), some limitations and problems have still not been well explored: (1) the tasks of multi-object tracking and trajectory prediction are both executed in a streaming manner in actual deployments, while the training procedure of most previous methods is conducted in a snap-shot pattern, where the length of historical window is fixed and the long-term information can not be fully exploited efficiently. (2) In general, the coordinates representation of objects for tracking and prediction are different, where a unified coordinate system is needed in MOT for optimal association while most prediction methods adopt the agent-centric coordinate representation for each object to ensure pose-invariance. (3) Most methods focus on predicting the future trajectories of objects visible in current frame, inadvertently overlooking those lost because of either occlusions or miss from upstream perception, which may result in adversely affecting downstream tasks.

In this paper, we introduce StreamMOTP, a streaming framework for joint multi-object tracking and trajectory prediction as depicted in Fig.1 (c), where the tasks of MOT and trajectory prediction are jointly performed on successive frames. Specifically, we associate the newly perceived objects with historical tracklets and predict their future trajectories simultaneously. Different from previous works, the extracted latent features of objects are sequentially utilized in StreamMOTP as part of the representation for the subsequent tracked objects during the forward propagation phase. As for the back-propagation, the gradients are not confined to a single frame but are propagated through multiple frames, which greatly narrows the gap between training and online inference, allowing for a more comprehensive learning process by accounting for temporal dependencies across the entire sequence.

Concretely, we extend the pattern of training from single-frame to multi-frame and introduce a memory bank to maintain and update long-term latent features for tracked objects, thereby improving the model’s capability for long-term sequence modeling. Aiming to address the coordinate system discrepancy between the tasks of tracking and prediction, we propose a relative Spatio-Temporal Positional Encoding (STPE) strategy, which is applied to realize the compromise and unification of the different agent- and ego-centric representation in the two tasks. At the same time, based on the observation that there is an obvious overlap between the predicted trajectories of objects in consecutive adjacent frames as depicted in Fig.1 (c-left), we apply dual-stream predictor to effortlessly and elegantly generate future trajectories for both tracked and new-come objects simultaneously, which benefits to both tasks of MOT and trajectory prediction.

It should be pointed out that, with the design of the streaming and unified framework, StreamMOTP obtains the potential and advantages to handle more complex driving scenarios in actual applications. On the one hand, the predicted trajectories for tracked objects could help deal with the problem of occlusions at the current moment by marking the possible positions of obscured targets in the current frame, as shown in Fig.1 (c-middle). On the other hand, for the objects newly perceived in the current frame, StreamMOTP maintains the capability to predict their future trajectories by leveraging social interactions and contextual features stored in the memory bank while traditional prediction methods may fail due to the lack of historical information about them, as shown in Fig.1 (c-right).

The core contributions are summarized as follows:

  • We propose StreamMOTP, a joint MOT and trajectory Prediction model based on a streaming framework to bridge the gap between training and actual deployment. A memory bank for tracked objects is introduced in this framework for utilizing long-term features more effectively.

  • We introduce a spatio-temporal positional encoding strategy to construct the relative relationship between objects in different frames, which reaches the compromise and unification of inconsistent coordinate representation in tracking and prediction.

  • We design a dual-stream predictor to simultaneously predict the trajectories of objects in both the current and previous frames. The predicted trajectory from the previous frame can further assist in predicting newly perceived objects’ trajectories, which achieves better temporal consistency in trajectory prediction.

  • We get better performance for MOT and trajectory prediction on nuScenes, improving AMOTA by 3.84% and reducing minADE / minFDE by 0.220 / 0.141.

II RELATED WORK

II-A 3D Multi-Object Tracking

Existing multi-object tracking paradigms, such as tracking-by-detection (DeepSORT[19], AB3DMOT[3]), Joint Detection and Embedding learning (FairMOT[20], JDE[21]), and joint detection and tracking (Tracktor++[22], YONDTMOT[23]), typically rely on Kalman filters(KFs) to predict the positions of tracked objects for better-association. Yet, KFs require fine-tuning of parameters and struggle with occlusions (PC3TMOT[24], DeepFusionMOT[25]). In contrast, dedicated prediction tasks can provide superior short-term prediction results for tracking, especially in handling complex scenarios such as occlusions. Therefore, combining the two tasks of multi-object-tracking and trajectory prediction can effectively improve the overall performance of multi-object tracking. This combination not only reduces the dependence on traditional methods like KFs but also enhances the robustness and adaptability of the tracking methods.

II-B Trajectory Prediction

There has been significant progress in trajectory prediction recently. With the use of pooling [7], graph convolution [9], attention mechanism [13] [14], vector-based methods [10] can efficiently aggregate sparse information in traffic scenes. As the future is uncertain, some works (Multipath++ [12], HiVT [26]) predict multimodal future distribution by decoding a set of trajectories from scene context while others (DenseTNT [11]) generate multimodal prediction by leveraging anchors. Though these methods greatly improve trajectory prediction, most of them use GT past trajectories as input for training and testing, neglecting tracking error accumulation with imperfect inputs. Therefore, we handle the tasks of tracking and prediction jointly with no need for GT trajectories as predictor’s inputs to provide more robust predictions based on practical detectors in the real world.

Refer to caption
Figure 2: Overview of StreamMOTP. Tracklets and proposals denote the previous frame trajectories and the current frame detections respectively. The model first performs Attentional Spatio-Temporal Interaction, which is based on attention with STPE, to get context features. The tasks of tracking and prediction are then performed based on those context features. Memories with up-to-date context features and tracking results are updated at each time step.

II-C Joint Tracking and Prediction

In the last couple of years, there has been growing interest in joint tracking and prediction. For example, [27] refine the inputs for the predictor through a re-tracking module, MTP [15] propose multi-hypothesis data association to generate multiple sets of tracks for predictor simultaneously. Besides polishing the input tracklets for the prediction module, some studies combine the tasks of tracking and prediction with joint optimization. PTP [16] and PnPNet [28] uses the shared feature representation to address both tasks. AffinPred [17], TTFD [18] use affinity matrices rather than tracklets as inputs of the prediction module to improve the forecasting performance, but they sacrifice the capability to provide tracking results explicitly. However, almost all of these methods are performed in a snap-shot form and neglect the misalignment issue between tracking and prediction. Compared to those approaches, our method uses a streaming framework and a unified spatio-temporal positional encoding method to address the above problems.

III APPROACH

III-A Streaming Framework

Simply, let 𝒟={d1,,dN}𝒟subscript𝑑1subscript𝑑𝑁\mathcal{D}=\{d_{1},\ldots,d_{N}\}caligraphic_D = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } represent the set of objects perceived in current frame from a 3D object detector, where N𝑁Nitalic_N denotes the number of objects. Concretely, each object at frame t𝑡titalic_t is represented as dit=[dipos,t,disize,t,dihead,t,diclass,t,discore,t]superscriptsubscript𝑑𝑖𝑡superscriptsubscript𝑑𝑖pos𝑡superscriptsubscript𝑑𝑖size𝑡superscriptsubscript𝑑𝑖head𝑡superscriptsubscript𝑑𝑖class𝑡superscriptsubscript𝑑𝑖score𝑡d_{i}^{t}=[d_{i}^{\text{pos},t},d_{i}^{\text{size},t},d_{i}^{\text{head},t},d_% {i}^{\text{class},t},d_{i}^{\text{score},t}]italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = [ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pos , italic_t end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT size , italic_t end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT head , italic_t end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT class , italic_t end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT score , italic_t end_POSTSUPERSCRIPT ] where each element denotes the position, size, heading angle, class and confidence score from the module of detection, respectively. In this paper, the goal of joint 3D multi-object tracking and trajectory prediction includes two parts, to obtain the association of multiple obstacles in adjacent frames by assigning a unique track ID to each object, and meanwhile to predict the future trajectories ={f1,,fN}subscript𝑓1subscript𝑓𝑁\mathcal{F}=\{f_{1},\ldots,f_{N}\}caligraphic_F = { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } for all agents in current frame, with each element of a trajectory specified by a two-dimensional coordinate (x,y)𝑥𝑦(x,y)( italic_x , italic_y ).

Based on the observation that the actual physical world is continuous and long-term history is essential for a safer autonomous driving system, we model the task of joint 3D multi-object tracking and trajectory prediction in a streaming manner (shown as Fig. 2). First of all, we extend the pattern of training from single-frame to multi-frame so as to narrow the gap between training and actual deployment. To be more specific, we introduce a Memory Bank for tracked objects to maintain long-term latent features for utilizing the long-term information more effectively, where the latent features are maintained through consecutive frames and could further benefit the performance of both tasks, including not only multi-object tracking but also trajectory prediction.

Specifically, the memory bank consists of F×N𝐹𝑁F\times Nitalic_F × italic_N latent features where F𝐹Fitalic_F is the length of the memory bank and N𝑁Nitalic_N is the number of objects stored per frame. At each time, the latent feature of those tracked objects that have been associated with the new perceived objects in the current frame would be saved into the memory bank. These features are then utilized in subsequent frames to enhance features for tracked objects, detailed in Sec. III-B. The entrance and exit of the memory bank follow the first-in, first-out rule.

III-B Spatio-Temporal Encoder

Feature Extraction. To capture the semantic and motion information of the obstacles in the driving scenario efficiently and adequately, we conduct feature extraction for the tracked and new-come objects separately. For the perceived objects from adjacent frames, we use dNp×C𝑑superscriptsubscript𝑁𝑝𝐶d\in\mathbb{R}^{N_{p}\times C}italic_d ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT and τNt×C𝜏superscriptsubscript𝑁𝑡𝐶\tau\in\mathbb{R}^{N_{t}\times C}italic_τ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT to represent the semantic features, where Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the number of objects at current frame t𝑡titalic_t (named as proposals) and previous frame t1𝑡1t-1italic_t - 1 (named as tracklets), respectively. At the same time, the historical trajectories of last Thsubscript𝑇T_{h}italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT frames for Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT tracked objects are represented with HNt×Th×C𝐻superscriptsubscript𝑁𝑡subscript𝑇𝐶H\in\mathbb{R}^{N_{t}\times T_{h}\times C}italic_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT. Simply and effectively, we deploy the Multi-Layer Perceptron (MLP) to encode the semantic information into high-dimension features and fuse the historical data H𝐻Hitalic_H to trakclets τ𝜏\tauitalic_τ through a Multi-Head Cross Attention (MHCA) as:

Fd=MLP(d),Ft~=MLP(τ)+MHCA(MLP(H))formulae-sequencesubscript𝐹𝑑MLP𝑑~subscript𝐹𝑡MLP𝜏MHCAMLP𝐻F_{d}=\operatorname{MLP}(d),\tilde{F_{t}}=\operatorname{MLP}(\tau)+% \operatorname{MHCA}(\operatorname{MLP}(H))italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = roman_MLP ( italic_d ) , over~ start_ARG italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = roman_MLP ( italic_τ ) + roman_MHCA ( roman_MLP ( italic_H ) ) (1)

where Ft~Nt×D~subscript𝐹𝑡superscriptsubscript𝑁𝑡𝐷\tilde{F_{t}}\in\mathbb{R}^{N_{t}\times D}over~ start_ARG italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT, FpNp×Dsubscript𝐹𝑝superscriptsubscript𝑁𝑝𝐷F_{p}\in\mathbb{R}^{N_{p}\times D}italic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT, and C𝐶Citalic_C, D𝐷Ditalic_D correspond to the dimension of the semantic and latent high-dimension features respectively.

Additionally, to equip our model with long-term temporal modeling capability, we exploit the latent features saved in the memory bank. Inspired by dynamic weight learning [29] [30], an ego transformation is applied to ensure the temporal alignment and effective feature usage across frames:

α,β𝛼𝛽\displaystyle\alpha,\betaitalic_α , italic_β =MLP(EtEs)absentMLPsubscript𝐸𝑡subscript𝐸𝑠\displaystyle=\operatorname{MLP}(E_{t}-E_{s})= roman_MLP ( italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) (2)
M𝑀\displaystyle Mitalic_M =αLN(M~)+βabsent𝛼LN~𝑀𝛽\displaystyle=\alpha\operatorname{LN}(\tilde{M})+\beta= italic_α roman_LN ( over~ start_ARG italic_M end_ARG ) + italic_β

where Eq.2 is an affine transformation and its parameters are derived from the ego difference between two frames. Then we apply temporal aggregation of long-term latent memory maintained in the memory bank for each tracked object with MHCAMHCA\operatorname{MHCA}roman_MHCA and then fuse the latent memory feature with the extracted feature of tracklets as follows:

Ft=Ft~+MHCA(M)subscript𝐹𝑡~subscript𝐹𝑡MHCA𝑀F_{t}=\tilde{F_{t}}+\operatorname{MHCA}(M)italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over~ start_ARG italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + roman_MHCA ( italic_M ) (3)

Spatio-Temporal Positional Encoding. For the task of tracking, aligning all features within a unified coordinate system is essential for feature association. In contrast, for prediction tasks, previous research [12][26] have demonstrated the advantages of agent-centric representations, which normalize various trajectories to local coordinate systems centered on the selected agent. To bridge the gap between coordinate representation between tracking and prediction, we propose a relative Spatio-Temporal Positional Encoding (STPE) strategy. This approach differentiates between coordinate-independent and dependent features, using the former as query tokens for attention mechanism during feature interaction, while the latter is incorporated into attention through relative positional encoding.

To be specific, we encode the relative spatio-temporal position between object i𝑖iitalic_i in previous frame t𝑡titalic_t (tracklet frame) and object j𝑗jitalic_j in current frame p𝑝pitalic_p (proposal frame) as follows:

δijtp=MLP([pjppit,θjpθit])superscriptsubscript𝛿𝑖𝑗𝑡𝑝MLPsuperscriptsubscript𝑝𝑗𝑝superscriptsubscript𝑝𝑖𝑡superscriptsubscript𝜃𝑗𝑝superscriptsubscript𝜃𝑖𝑡\delta_{ij}^{tp}=\operatorname{MLP}([p_{j}^{p}-p_{i}^{t},\theta_{j}^{p}-\theta% _{i}^{t}])italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_p end_POSTSUPERSCRIPT = roman_MLP ( [ italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] ) (4)

Attentional Spatio-Temporal Interaction. Based on relative embedding from the spatio-temporal positional encoding strategy, we fuse the features of proposals and tracklets with cross-attention and self-attention iteratively. Take the proposal branch as an example, we use query-centric attention with a spatio-temporal positional encoding strategy, incorporating the relative positional embedding into key/value of the attention mechanism:

Fipsuperscriptsubscript𝐹𝑖𝑝\displaystyle F_{i}^{p\prime}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p ′ end_POSTSUPERSCRIPT =MHCA(𝐐=Fip,𝐊/𝐕={Fjt+δijtp}jNi)absentMHCA𝐐superscriptsubscript𝐹𝑖𝑝𝐊𝐕subscriptsuperscriptsubscript𝐹𝑗𝑡superscriptsubscript𝛿𝑖𝑗𝑡𝑝𝑗subscript𝑁𝑖\displaystyle=\operatorname{MHCA}\left(\mathbf{Q}=F_{i}^{p},\mathbf{K/V}=\{F_{% j}^{t}+\delta_{ij}^{tp}\}_{j\in N_{i}}\right)= roman_MHCA ( bold_Q = italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_K / bold_V = { italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_p end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j ∈ italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (5)
Fipsuperscriptsubscript𝐹𝑖𝑝\displaystyle F_{i}^{p\prime}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p ′ end_POSTSUPERSCRIPT =MHSA(𝐐=Fip,𝐊/𝐕={Fjp+δijp}jNi)absentMHSA𝐐superscriptsubscript𝐹𝑖𝑝𝐊𝐕subscriptsuperscriptsubscript𝐹𝑗𝑝superscriptsubscript𝛿𝑖𝑗𝑝𝑗subscript𝑁𝑖\displaystyle=\operatorname{MHSA}\left(\mathbf{Q}=F_{i}^{p},\mathbf{K/V}=\{F_{% j}^{p}+\delta_{ij}^{p}\}_{j\in N_{i}}\right)= roman_MHSA ( bold_Q = italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_K / bold_V = { italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT + italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j ∈ italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT )

As shown in Eq. 5, we first employ cross-attention to fuse the tracked objects’ information from previous frames into new-come objects from current frames. Subsequently, self-attention is utilized within the current frame to foster awareness among detected objects in this frame. This process enables the features of newly perceived objects to incrementally assimilate comprehensive information, enriching their contextual awareness. We denote the result of this branch as proposals context feature Fipsuperscriptsubscript𝐹𝑖superscript𝑝F_{i}^{p^{\prime}}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Similarly, The tracklet branch undergoes the same propagation and gets tracklets context feature Fitsuperscriptsubscript𝐹𝑖superscript𝑡F_{i}^{t^{\prime}}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT in parallel.

III-C MOT Head

Association with Optimal Transport. The core purpose of the MOT head for StreamMOTP is to associate the M-tracked objects in the previous frame and the N-perceived objects in the current frame. To find the association relationship, we learn an affinity matrix A(tp)Nt×Npsuperscript𝐴tpsuperscriptsubscript𝑁𝑡subscript𝑁𝑝A^{\left(\text{tp}\right)}\in\mathbb{R}^{N_{t}\times N_{p}}italic_A start_POSTSUPERSCRIPT ( tp ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT based on the tracklets context feature and proposals context feature after feature interaction. We use Dot Product to calculate the similarity pair, so each entry Aij(tp)subscriptsuperscript𝐴tp𝑖𝑗A^{\left(\text{tp}\right)}_{ij}italic_A start_POSTSUPERSCRIPT ( tp ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the similarity score between the tracked object i𝑖iitalic_i and the detected object j𝑗jitalic_j.

Aij(tp)=Fit,FjpD,(i,j)Nt×Npformulae-sequencesubscriptsuperscript𝐴tp𝑖𝑗subscriptsuperscript𝐹superscript𝑡𝑖subscriptsuperscript𝐹superscript𝑝𝑗𝐷for-all𝑖𝑗subscript𝑁𝑡subscript𝑁𝑝A^{\left(\text{tp}\right)}_{ij}=\frac{\langle F^{t^{\prime}}_{i},F^{p^{\prime}% }_{j}\rangle}{\sqrt{D}},\forall(i,j)\in N_{t}\times N_{p}italic_A start_POSTSUPERSCRIPT ( tp ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG ⟨ italic_F start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_F start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ end_ARG start_ARG square-root start_ARG italic_D end_ARG end_ARG , ∀ ( italic_i , italic_j ) ∈ italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (6)

where D𝐷Ditalic_D is the dimension of the context feature.

Given the affinity matrix, we get the optimal affinity matrix A(opt)(Nt+1)×(Np+1)superscript𝐴optsuperscriptsubscript𝑁𝑡1subscript𝑁𝑝1A^{(\text{opt})}\in\mathbb{R}^{(N_{t}+1)\times(N_{p}+1)}italic_A start_POSTSUPERSCRIPT ( opt ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 ) × ( italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) end_POSTSUPERSCRIPT through log sinkhorn algorithm as SuperGlue [31], which performs differentiable optimal transport in log-space for stability. Under our streaming framework, the use of the log sinkhorn algorithm allows the model to modify the model parameters of previous frames while optimizing subsequent frames for continuous tracking and prediction. The last row and the last column of A(opt)superscript𝐴optA^{(\text{opt})}italic_A start_POSTSUPERSCRIPT ( opt ) end_POSTSUPERSCRIPT respectively represent newly appeared objects and tracklets without corresponding matched objects.

Tracking Loss. We supervise the output affinity matrix A(opt)superscript𝐴(opt)A^{\text{(opt)}}italic_A start_POSTSUPERSCRIPT (opt) end_POSTSUPERSCRIPT with the ground truth (GT) relationship represented by the matrix A(g)(Nt+1)×(Np+1)superscript𝐴(g)superscriptsubscript𝑁𝑡1subscript𝑁𝑝1A^{\text{(g)}}\in\mathbb{R}^{(N_{t}+1)\times(N_{p}+1)}italic_A start_POSTSUPERSCRIPT (g) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 ) × ( italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) end_POSTSUPERSCRIPT. The accuracy of A(opt)superscript𝐴(opt)A^{\text{(opt)}}italic_A start_POSTSUPERSCRIPT (opt) end_POSTSUPERSCRIPT is judged by how closely its high-value elements align with the ones in A(g)superscript𝐴(g)A^{\text{(g)}}italic_A start_POSTSUPERSCRIPT (g) end_POSTSUPERSCRIPT. Therefore, we use the following loss:

tracking=1Nm(A(opt)eU+U)A(g)subscripttracking1subscript𝑁𝑚superscript𝐴optsuperscript𝑒𝑈𝑈superscript𝐴g\mathcal{L}_{\text{tracking}}=-\frac{1}{N_{m}}\cdot(A^{\operatorname{(opt)}}e^% {-U}+U)\cdot A^{(\text{g})}caligraphic_L start_POSTSUBSCRIPT tracking end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ⋅ ( italic_A start_POSTSUPERSCRIPT ( roman_opt ) end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_U end_POSTSUPERSCRIPT + italic_U ) ⋅ italic_A start_POSTSUPERSCRIPT ( g ) end_POSTSUPERSCRIPT (7)

where the uncertainty matrix U(Nt+1)×(Np+1)𝑈superscriptsubscript𝑁𝑡1subscript𝑁𝑝1U\in\mathbb{R}^{(N_{t}+1)\times(N_{p}+1)}italic_U ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 ) × ( italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) end_POSTSUPERSCRIPT is derived from tracklets and proposals feature to ensure the robustness of training, and Nmsubscript𝑁𝑚N_{m}italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the number of matching pairs in A(g)A^{\text{(}g)}italic_A start_POSTSUPERSCRIPT ( italic_g ) end_POSTSUPERSCRIPT. Finally, we get association relationship A𝐴Aitalic_A from A(opt)superscript𝐴(opt)A^{\text{(opt)}}italic_A start_POSTSUPERSCRIPT (opt) end_POSTSUPERSCRIPT.

III-D Dual-Stream Predictor

The predictor predicts all agents’ multi-modal future trajectories. The detail of the predictor is shown in Fig. 3.

Refer to caption
Figure 3: Overview of dual-stream predictor. Two branches predict the previous frame trajectories and the current frame detections simultaneously, The streaming connection between consecutive frames smooth the predicted trajectories.

Single Frame Prediction. To jointly predict all future trajectories for perceived objects in the current frame, we utilize a transformer-based decoder that incorporates the previous encoded context feature by learnable intention queries. To combine the advantages of the prior acceleration of convergence provided by the anchor-based model[13] and the high flexibility of the anchor-free model[12], we combine learnable tokens and anchors to form the query:

Qpl=I+ϕ(AT)+ϕ(x^Tl1)superscriptsubscript𝑄𝑝𝑙𝐼italic-ϕsubscript𝐴𝑇italic-ϕsuperscriptsubscript^𝑥𝑇𝑙1Q_{p}^{l}=I+\phi(A_{T})+\phi(\hat{x}_{T}^{l-1})italic_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_I + italic_ϕ ( italic_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) + italic_ϕ ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) (8)

where QplNp×K×Dsuperscriptsubscript𝑄𝑝𝑙superscriptsubscript𝑁𝑝𝐾𝐷Q_{p}^{l}\in\mathbb{R}^{N_{p}\times K\times D}italic_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_K × italic_D end_POSTSUPERSCRIPT is the query input at the current frame and layer l𝑙litalic_l decoder, which is combined from a learnable embedding I𝐼Iitalic_I, the endpoints of the anchors ATsubscript𝐴𝑇A_{T}italic_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and the predicted endpoints of previous layer x^Tl1superscriptsubscript^𝑥𝑇𝑙1\hat{x}_{T}^{l-1}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT, which are fused through ϕitalic-ϕ\phiitalic_ϕ (a sinusoidal position encoding followed by an MLPMLP\operatorname{MLP}roman_MLP). Next, to aggregate features from context embedding, we perform attention mechanism on the temporal and social dimensions to get multi-modal prediction output.

Dual-Stream Predictor. It is obvious that the predictions for previously tracked objects and currently perceived objects share a large overlap on those matched objects. As shown in Fig.4, the Tfsubscript𝑇𝑓T_{f}italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT+1 predictions from frame t𝑡titalic_t-1 should be consistent with the Tfsubscript𝑇𝑓T_{f}italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT predictions from current frame t𝑡titalic_t in the last Tfsubscript𝑇𝑓T_{f}italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT frames. And it’s much more feasible to generate consecutive output trajectories with the streaming nature of the proposed framework of StreamMOTP.

Based on the observations, we propose a dual-stream predictor to improve the quality and temporal consistency of the predicted trajectories. The predictor comprises two branches: a primary branch focuses on making predictions for the detected objects in current frame and a supportive auxiliary branch for the previous tracked objects. The primary branch follows Single Frame Prediction to predict from the context features of proposals, while the auxiliary branch leverages the context features of tracklets to generate K𝐾Kitalic_K adaptive predictions Yt^Nt×K×(Tf+1)×2^subscript𝑌𝑡superscriptsubscript𝑁𝑡𝐾subscript𝑇𝑓12\hat{Y_{t}}\in\mathbb{R}^{N_{t}\times K\times(T_{f}+1)\times 2}over^ start_ARG italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_K × ( italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + 1 ) × 2 end_POSTSUPERSCRIPT specific to the tracked objects. Since the prediction result Yt^^subscript𝑌𝑡\hat{Y_{t}}over^ start_ARG italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG from the tracklet frame and Y^psubscript^𝑌𝑝\hat{Y}_{p}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT from the proposal frame have Tfsubscript𝑇𝑓T_{f}italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT overlap**, using Yt^^subscript𝑌𝑡\hat{Y_{t}}over^ start_ARG italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG to guide the prediction of Yp^^subscript𝑌𝑝\hat{Y_{p}}over^ start_ARG italic_Y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG enhances both accuracy and temporal coherence of the predicted trajectory. Specifically, we encode and map the overlap** Tfsubscript𝑇𝑓T_{f}italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT frame of Yt^^subscript𝑌𝑡\hat{Y_{t}}over^ start_ARG italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG to yield auxiliary features FYtNp×K×Tf×Dsubscript𝐹subscript𝑌𝑡superscriptsubscript𝑁𝑝𝐾subscript𝑇𝑓𝐷F_{Y_{t}}\in\mathbb{R}^{N_{p}\times K\times T_{f}\times D}italic_F start_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_K × italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT:

FYt=MLP(PE(ATYt^))subscript𝐹subscript𝑌𝑡MLPPEsuperscript𝐴𝑇^subscript𝑌𝑡F_{Y_{t}}=\operatorname{MLP}(\operatorname{PE}(A^{T}\hat{Y_{t}}))italic_F start_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_MLP ( roman_PE ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ) (9)

where PE()PE\operatorname{PE}(\cdot)roman_PE ( ⋅ ) denotes sinusoidal position encoding, ANt×Np𝐴superscriptsubscript𝑁𝑡subscript𝑁𝑝A\in\mathbb{R}^{N_{t}\times N_{p}}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the association matrix given by MOT head.

Refer to caption
Figure 4: The idea of temporal consistency between consecutive frames, where the consistency of the overlap is beneficial for aligning trajectories for continuity and stability.

In addition to Single Frame Prediction, auxiliary features and anchor queries from the current frame are aggregated together in our dual-stream predictor. We adopt multi-head cross attention, taking the anchor embedding from the current frame as query, and the prediction features from the auxiliary tracklet branch as key and value:

Q=MHCA(𝐐=Q,𝐊/𝐕=FYt)𝑄MHCA𝐐𝑄𝐊𝐕subscript𝐹subscript𝑌𝑡Q=\operatorname{MHCA}(\mathbf{Q}=Q,\mathbf{K/V}=F_{Y_{t}})italic_Q = roman_MHCA ( bold_Q = italic_Q , bold_K / bold_V = italic_F start_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (10)

We place Eq. 10 after the interaction between queries and proposals context features, while before the self-attention of the queries, making the queries interact sequentially with historical features, future features, and the social context.

Multi-modal Prediction with Gaussian Mixture Model. As the future behaviors of the agents are highly multi-modal, we follow [12] to represent the distribution of predicted trajectories with Gaussian Mixture Model (GMM):

f({𝐘it}t=1Tf)=h=1Kpi,kt=1Tf GMM (𝐘it𝝁i,kt,σi,kt)𝑓superscriptsubscriptsuperscriptsubscript𝐘𝑖𝑡𝑡1subscript𝑇𝑓superscriptsubscript1𝐾subscript𝑝𝑖𝑘superscriptsubscriptproduct𝑡1subscript𝑇𝑓 GMM conditionalsuperscriptsubscript𝐘𝑖𝑡superscriptsubscript𝝁𝑖𝑘𝑡superscriptsubscript𝜎𝑖𝑘𝑡f\left(\left\{\mathbf{Y}_{i}^{t}\right\}_{t=1}^{T_{f}}\right)=\sum_{h=1}^{K}p_% {i,k}\prod_{t=1}^{T_{f}}\text{ GMM }\left(\mathbf{Y}_{i}^{t}\mid\boldsymbol{% \mu}_{i,k}^{t},\mathbf{\sigma}_{i,k}^{t}\right)italic_f ( { bold_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT GMM ( bold_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∣ bold_italic_μ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (11)

where {pi,k}k=1Ksuperscriptsubscriptsubscript𝑝𝑖𝑘𝑘1𝐾\left\{p_{i,k}\right\}_{k=1}^{K}{ italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is the probability distribution between K𝐾Kitalic_K modes, and the klimit-from𝑘k-italic_k -th mixture component’s Gaussian density for agent i𝑖iitalic_i at time step t𝑡titalic_t is parameterized by μi,ktsuperscriptsubscript𝜇𝑖𝑘𝑡\mu_{i,k}^{t}italic_μ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and σi,ktsuperscriptsubscript𝜎𝑖𝑘𝑡\sigma_{i,k}^{t}italic_σ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Given Eq. 11 for all predicted steps, we adopt negative log-likelihood loss and supervised predictions for new-come objects in the current frame and predictions for the tracked objects simultaneously. Loss can be formulated as:

prediction=logf(Yp^)logf(Yt^)subscriptprediction𝑓^subscript𝑌𝑝𝑓^subscript𝑌𝑡\mathcal{L}_{\text{prediction}}=-\log f(\hat{Y_{p}})-\log f(\hat{Y_{t}})caligraphic_L start_POSTSUBSCRIPT prediction end_POSTSUBSCRIPT = - roman_log italic_f ( over^ start_ARG italic_Y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ) - roman_log italic_f ( over^ start_ARG italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) (12)

Then, the final loss of our model is denoted as:

=λtracking+prediction𝜆subscripttrackingsubscriptprediction\mathcal{L}=\lambda\mathcal{L}_{\text{tracking}}+\mathcal{L}_{\text{prediction}}caligraphic_L = italic_λ caligraphic_L start_POSTSUBSCRIPT tracking end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT prediction end_POSTSUBSCRIPT (13)

where λ>0𝜆subscriptabsent0\lambda\in\mathbb{R}_{>0}italic_λ ∈ blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT is the weight for tracking loss to balance the the joint optimization of the two tasks.

IV EXPERIMENTS

TABLE I: Comparison with existing approaches (on nuScenes). All results is based on detections from Megvii.
(a) 3D MOT Performance
Methods AMOTA \uparrow MOTA \uparrow
mmMOT [1] 23.93 19.82
GNN3DMOT [2] 29.84 23.53
AB3DMOT [3] 39.90 31.40
PTP [16] 42.36 32.06
StreamMOTP 46.30 40.50
(b) One Step MOTP Performance
Methods minADE \downarrow minFDE \downarrow
Social-GAN [7] 1.794 2.850
TraPHic [8] 1.827 2.760
Graph-LSTM [32] 1.646 2.445
PTP [16] 1.017 1.527
StreamMOTP 0.810 1.481
(c) Multi Step MOTP Performance
Methods minADE \downarrow minFDE \downarrow
PTP [16] 2.320 3.819
MTP(S=10) [16] 1.585 2.512
MTP(S=200) 1.325 1.979
AffinPred [17] 0.977 1.628
StreamMOTP 0.757 1.487
TABLE II: Ablation study on the components of StreamMOTP.
Memory Bank STPE Stream Predictor AMOTA AMOTP MOTA minADE minFDE MR tc
0.523 0.781 0.426 0.572 0.942 0.113 -
0.556 0.770 0.466 0.384 0.594 0.075 -
0.528 0.782 0.431 0.524 0.838 0.103 -
0.544 0.768 0.456 0.488 0.776 0.098 2.081
0.556 0.779 0.472 0.377 0.586 0.072 1.942

IV-A Experimental Setup and Implementation Details

Dataset and Metrics. The proposed method is evaluated on the popular nuScenes dataset. Following the standard practices [33] of nuSences dataset, we predict trajectories for objects perceived in the current frame and use the distance threshold of 2m to match them with GT future trajectories. In the task of trajectory prediction, the models predict future trajectories for 3s and 6s to align with other works, with a time interval of 0.5s, based on 2s historical data. As for the task of MOT, We employ the commonly-used AMOTA, MOTA, and AMOTP for evaluation. And standard minADE and minFDE metrics are used to evaluate the prediction performance. Moreover, we design the metric of ‘tc’ to evaluate the temporal consistency, which is calculated as the ADE in Tf1subscript𝑇𝑓1T_{f}-1italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - 1 overlap** frames between predictions from T𝑇Titalic_T to T+Tf𝑇subscript𝑇𝑓T+T_{f}italic_T + italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and predictions from T1𝑇1T-1italic_T - 1 to T+Tf1𝑇subscript𝑇𝑓1T+T_{f}-1italic_T + italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - 1.

Inputs. In StreamMOTP, input data is formatted in a sequential format. During training, we split the streaming video into training slices and use a sliding window to sequentially get the inputs at each timestamp. To address detector noise, we incorporate the detected results and employ the ground truth (GT) matching relationships up to the (t1)𝑡1(t-1)( italic_t - 1 )-th frame to create history tracks. Newly perceived objects without association in the current frame serve as proposals. In online inference, the model takes raw detections as input to perform tracking and prediction jointly.

Training. To avoid poor latent memories which may impede the training procedure in early stages, scheduled sampling [34] is applied to the memory bank. We train our model for 180 epochs. Specifically, features in the memory bank are selected through sampling, and the sampling rate starts to increase at epoch 30, following a sigmoid curve.

IV-B Comparison with Related Work

Table I(c) compares StreamMOTP with other methods in tracking and prediction, using the same Megvii[35] detector for fairness. For MOT, we evaluate all categories, while for trajectory prediction, we adopt two settings from prior studies: (1) Setting1: One Step MOTP. In Setting1, we follow a single-step tracking and 3s prediction, similar to PTP [16]. The model uses GT past trajectories t{TcTh,,Tc1}𝑡subscript𝑇𝑐subscript𝑇subscript𝑇𝑐1t\in\{T_{c}-T_{h},\cdots,T_{c}-1\}italic_t ∈ { italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , ⋯ , italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - 1 } and GT detections in the current frame Tcsubscript𝑇𝑐T_{c}italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, conducts MOT at the current frame, and forecasts future trajectories in frames t{Tc+1,Tc+Tf}𝑡subscript𝑇𝑐1subscript𝑇𝑐subscript𝑇𝑓t\in\{T_{c}+1,\cdots T_{c}+T_{f}\}italic_t ∈ { italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + 1 , ⋯ italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT }. Results for all classes from the nuScenes Prediction Challenge are reported. This setting is more suitable for Vehicle-to-Vehicle (V2V) scenario. (2) Setting2: Multi Step MOTP. In setting2, we perform standard tracking and 6s prediction for detected objects in Tcsubscript𝑇𝑐T_{c}italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, based on their tracked histories, and evaluate prediction results on all vehicle classes. This setting aligns more closely with the current stage of autonomous driving and is more widely adopted in industry deployments.

Our model surpasses previous related work in both tasks of multi-object tracking and trajectory prediction. In MOT performance, shown in Table I(a), our model not only achieves gains over PTP baseline [16] with improvements of 3.94% in AMOTA and 8.44% in MOTA, but also surpasses several competing trackers. Table I(b) shows the prediction comparison for one-step MOTP. Our model reaches the lowest minADE of 0.810 and minFDE of 1.481, which outperforms PTP [16] by 0.207 on minADE and 0.046. Moreover, Table I(c) offers a comparison of multi-step MOTP’s predictions, where our model attains state-of-the-art performance with a minADE of 0.757 and a minFDE of 1.487, outperforming AffinPred [17] by 0.220 and 0.141, respectively. The improvements in Table I(c) are more obvious than in Table I(b) for the reason that trajectory prediction in setting1 is more saturated than in setting2, indicating the larger growth potential for prediction based on tracked trajectory.

Refer to caption
Figure 5: Qualitative results of StreamMOTP on the nuScenes validation set during consecutive frames. The tracked history and detection are shown in black, models’ best score prediction and ground-truth trajectories are drawn in blue and red respectively. The predictions of other modes are drawn in gray. The top row shows the results given by the dual-stream predictor while the bottom row shows the results with a base predictor.
TABLE III: The effect of training slice length (Abbreviated as ”Slice”) and memory bank (Abbreviated as ”Mem”).
Slice Mem AMOTA MOTA minADE minFDE MR
3 0.570 0.490 0.633 0.953 0.137
5 0.560 0.478 0.402 0.621 0.075
10 0.557 0.466 0.384 0.594 0.075
3 0.570 0.486 0.537 0.813 0.119
5 0.564 0.478 0.392 0.602 0.072
10 0.556 0.472 0.377 0.586 0.072
TABLE IV: Ablation study of Memory Bank in Slice=3.
Memory Length AMOTA MOTA minADE minFDE MR
0 0.570 0.490 0.633 0.953 0.137
1 0.569 0.487 0.603 0.921 0.135
2 0.570 0.486 0.537 0.813 0.120

IV-C Ablation Studies

We evaluated the impact of each module within our StreamMOTP framework, as summarized in Tabel II, where the bottom row represents the full implementation of our method. All models are experimented on Setting2, except that the detector is switched to CenterPoint[36] and 3s prediction metrics are computed on True Positive detections at a recall rate of 0.6. The Megvii detector, being an older model, exhibits subpar detection capabilities. Therefore, we switch to a detector with relatively moderate performance to better measure each module’s efficacy.

Effects of each module. Firstly, upon removing the memory bank, we observed a slight decline in performance for both tracking and prediction tasks. We will further explore its impact later. Secondly, we remove the spatio-temporal positional encoding in the spatio-temporal interaction module and encode the absolute coordinate feature in the same way as the attribute feature. There is a significant drop in performance for both tasks of tracking and prediction, which shows that spatio-temporal positional encoding maintains the pose-invariance for trajectory predictions and effectively addresses the issue of inconsistent coordinate representations. Thirdly, we replace the dual streaming predictor with a single frame predictor performed only on the current frame. The second-last row shows that the dual-stream predictor plays a vital role in advancing prediction performance. The modest decrease in tracking further corroborates that augmenting prediction capabilities also benefits tracking results. Notably, the tc metric also drops when the dual-stream predictor is eliminated, which indicates that the dual-stream predictor enhances the trajectory predictions’ quality and consistency. The reason is that in two consecutive frames, predictions from previous frames serve as a valuable prior reference for predicting current perceived objects’ trajectories, which helps to yield more viable and steady outcomes.

Effects of streaming framework. The effectiveness of the streaming framework and the memory bank is explored by adjusting the lengths of training segments. In Table III, tracking performance stays consistent, whereas prediction accuracy significantly benefits from longer training slices due to its dependence on sequential and extensive sequential information. This finding stems from the gap that our models are trained in split slices (multi-frame sequences of length k𝑘kitalic_k) but evaluated in streaming video (the average length is 40 in nuScenes, k40much-less-than𝑘40k\ll 40italic_k ≪ 40). This gap constrains the effectiveness of approaches, especially for previous snap-shot methods. Our streaming framework narrows this gap between the segmented training approach and continuous video inference by utilizing temporal information over successive frames, thus enhancing prediction performance. Moreover, the integration of the memory bank, particularly with shorter slices, markedly boosts prediction accuracy by the retention and utilization of long-term latent features in the memory bank, therefore improving the model’s capability for long-term sequence modeling. This is crucial under resource constraints that limit slice length and temporal receptive field. Furthermore, Table IV shows that as the length of the memory bank expands, the model’s performance grows, which further demonstrates the impact of the memory bank.

TABLE V: Model performance on varying detectors.
Detectors AMOTA AMOTP MOTA minADE minFDE MR
Megvii 0.463 0.997 0.405 0.470 0.751 0.096
CenterPoint 0.556 0.779 0.472 0.377 0.586 0.072

Generalization performance on different detectors. We applied our model with different detectors and summarized the result in TableV. The significant growth of CenterPoint compared to Megvii in tracking and 3s prediction underscores our model’s strong generalization ability, independent of specific detectors. It is anticipated that the model will achieve superior performance with advanced detectors.

IV-D Qualitative Results

We provide some qualitative results in Fig. 5 to show our predictions. There is a brand new object without historical trajectory perceived at frame t𝑡titalic_t. StreamMOTP successfully predicts its future trajectory with social interactions. Moreover, by comparing the two rows, we can see that all mode predictions in the top row are smoother and more precise, and the highest score of the predictions fluctuates less.

V CONCLUSIONS

In this paper, we introduce StreamMOTP, a streaming and unified framework for joint multi-object tracking and trajectory prediction. With the design of the memory bank, spatio-temporal positional encoding strategy, and dual-stream predictor, streamMOTP bridges the gap between training and actual deployment, as well as maintains better capability and great potential for both tasks of multi-object tracking and trajectory prediction. The experiments on nuSences demonstrate the effectiveness and superiority of the proposed framework. We hope this work could further offer insights into the multi-task end-to-end autonomous driving systems.

References

  • [1] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, “Robust multi-modality multi-object tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2365–2374.
  • [2] X. Weng, Y. Wang, Y. Man, and K. M. Kitani, “Gnn3dmot: Graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6499–6508.
  • [3] X. Weng, J. Wang, D. Held, and K. Kitani, “Ab3dmot: A baseline for 3d multi-object tracking and new evaluation metrics,” arXiv preprint arXiv:2008.08063, 2020.
  • [4] X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C.-L. Tai, “Transfusion: Robust lidar-camera fusion for 3d object detection with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1090–1099.
  • [5] L. Wang, X. Zhang, W. Qin, X. Li, J. Gao, L. Yang, Z. Li, J. Li, L. Zhu, H. Wang et al., “Camo-mot: Combined appearance-motion optimization for 3d multi-object tracking with camera-lidar fusion,” IEEE Transactions on Intelligent Transportation Systems, 2023.
  • [6] X. Li, T. Xie, D. Liu, J. Gao, K. Dai, Z. Jiang, L. Zhao, and K. Wang, “Poly-mot: A polyhedral framework for 3d multi-object tracking,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2023, pp. 9391–9398.
  • [7] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social gan: Socially acceptable trajectories with generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2255–2264.
  • [8] R. Chandra, U. Bhattacharya, A. Bera, and D. Manocha, “Traphic: Trajectory prediction in dense and heterogeneous traffic using weighted interactions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8483–8492.
  • [9] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan, “Semantic object parsing with graph lstm,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14.   Springer, 2016, pp. 125–143.
  • [10] J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid, “Vectornet: Encoding hd maps and agent dynamics from vectorized representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 525–11 533.
  • [11] J. Gu, C. Sun, and H. Zhao, “Densetnt: End-to-end trajectory prediction from dense goal sets,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 303–15 312.
  • [12] B. Varadarajan, A. Hefny, A. Srivastava, K. S. Refaat, N. Nayakanti, A. Cornman, K. Chen, B. Douillard, C. P. Lam, D. Anguelov et al., “Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 7814–7821.
  • [13] S. Shi, L. Jiang, D. Dai, and B. Schiele, “Motion transformer with global intention localization and local movement refinement,” Advances in Neural Information Processing Systems, vol. 35, pp. 6531–6543, 2022.
  • [14] Z. Zhou, J. Wang, Y.-H. Li, and Y.-K. Huang, “Query-centric trajectory prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 863–17 873.
  • [15] X. Weng, B. Ivanovic, and M. Pavone, “Mtp: Multi-hypothesis tracking and prediction for reduced error propagation,” in 2022 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2022, pp. 1218–1225.
  • [16] X. Weng, Y. Yuan, and K. Kitani, “Ptp: Parallelized tracking and prediction with graph neural networks and diversity sampling,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4640–4647, 2021.
  • [17] X. Weng, B. Ivanovic, K. Kitani, and M. Pavone, “Whose track is it anyway? improving robustness to tracking errors with affinity-based trajectory prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6573–6582.
  • [18] P. Zhang, L. Bai, Y. Wang, J. Fang, J. Xue, N. Zheng, and W. Ouyang, “Towards trajectory forecasting from detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [19] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in 2017 IEEE international conference on image processing (ICIP).   IEEE, 2017, pp. 3645–3649.
  • [20] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “Fairmot: On the fairness of detection and re-identification in multiple object tracking,” International Journal of Computer Vision, vol. 129, pp. 3069–3087, 2021.
  • [21] Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang, “Towards real-time multi-object tracking,” in European conference on computer vision.   Springer, 2020, pp. 107–122.
  • [22] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells and whistles,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 941–951.
  • [23] X. Wang, J. He, C. Fu, T. Meng, and M. Huang, “You only need two detectors to achieve multi-modal 3d multi-object tracking,” arXiv preprint arXiv:2304.08709, 2023.
  • [24] H. Wu, W. Han, C. Wen, X. Li, and C. Wang, “3d multi-object tracking in point clouds based on prediction confidence-guided data association,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 5668–5677, 2021.
  • [25] X. Wang, C. Fu, Z. Li, Y. Lai, and J. He, “Deepfusionmot: A 3d multi-object tracking framework based on camera-lidar fusion with deep association,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 8260–8267, 2022.
  • [26] Z. Zhou, L. Ye, J. Wang, K. Wu, and K. Lu, “Hivt: Hierarchical vector transformer for multi-agent motion prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8823–8833.
  • [27] R. Yu and Z. Zhou, “Towards robust human trajectory prediction in raw videos,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 8059–8066.
  • [28] M. Liang, B. Yang, W. Zeng, Y. Chen, R. Hu, S. Casas, and R. Urtasun, “Pnpnet: End-to-end perception and prediction with tracking in the loop,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 553–11 562.
  • [29] S. Wang, Y. Liu, T. Wang, Y. Li, and X. Zhang, “Exploring object-centric temporal modeling for efficient multi-view 3d object detection,” arXiv preprint arXiv:2303.11926, 2023.
  • [30] G. Aydemir, A. K. Akan, and F. Güney, “Adapt: Efficient multi-agent trajectory prediction with adaptation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8295–8305.
  • [31] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947.
  • [32] R. Chandra, T. Guan, S. Panuganti, T. Mittal, U. Bhattacharya, A. Bera, and D. Manocha, “Forecasting trajectory and behavior of road-agents using spectral clustering in graph-lstms,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4882–4890, 2020.
  • [33] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
  • [34] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” Advances in neural information processing systems, vol. 28, 2015.
  • [35] B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu, “Class-balanced grou** and sampling for point cloud 3d object detection,” arXiv preprint arXiv:1908.09492, 2019.
  • [36] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 784–11 793.