StreamMOTP: Streaming and Unified Framework for Joint 3D Multi-Object Tracking and Trajectory Prediction

Jiaheng Zhuang¹, Guoan Wang², Siyu Zhang², Xiyang Wang²,
Hangning Zhou², Ziyao Xu², Chi Zhang², Zhiheng Li¹ ¹ are with Tsinghua University, China.² are with Mach Drive, China.Email:[email protected]

Abstract

3D multi-object tracking and trajectory prediction are two crucial modules in autonomous driving systems. Generally, the two tasks are handled separately in traditional paradigms and a few methods have started to explore modeling these two tasks in a joint manner recently. However, these approaches suffer from the limitations of single-frame training and inconsistent coordinate representations between tracking and prediction tasks. In this paper, we propose a streaming and unified framework for joint 3D Multi-Object Tracking and trajectory Prediction (StreamMOTP) to address the above challenges. Firstly, we construct the model in a streaming manner and exploit a memory bank to preserve and leverage the long-term latent features for tracked objects more effectively. Secondly, a relative spatio-temporal positional encoding strategy is introduced to bridge the gap of coordinate representations between the two tasks and maintain the pose-invariance for trajectory prediction. Thirdly, we further improve the quality and consistency of predicted trajectories with a dual-stream predictor. We conduct extensive experiments on popular nuSences dataset and the experimental results demonstrate the effectiveness and superiority of StreamMOTP, which outperforms previous methods significantly on both tasks. Furthermore, we also prove that the proposed framework has great potential and advantages in actual applications of autonomous driving.

Refer to caption — Figure 1: Different pipelines for the tasks of multi-object tracking and trajectory prediction in autonomous driving. (a) Cascade paradigm, where the two tasks are performed separately with non-differentiable transitions. (b) Joint single-frame paradigm, where the two tasks are performed jointly in a parallelized framework per frame. (c) The proposed StreamMOTP, where the memory, feature, and gradient are propagated across consecutive frames to enhance the long-term modeling ability and temporal consistency.

I INTRODUCTION

In autonomous driving systems, 3D Multi-Object Tracking (MOT) [1, 2, 3, 4, 5, 6] and trajectory prediction [7, 8, 9, 10, 11, 12, 13, 14] are two crucial tasks which play a vital role in ensuring the driving performance of ego-vehicle. Obviously, high-precision tracking can provide a more solid foundation for prediction, and in turn, accurate predictions can enhance the effectiveness of tracking. As depicted in Fig.1 (a), the two tasks are executed one after another in current mainstream pipelines of autonomous driving. Although this paradigm has achieved some success, the separated processing flow can not fully exploit the potential complementarity between the tasks of tracking and prediction, since it suffers from information loss, feature misalignment, and error accumulation across modules [15]. Despite some methods [16, 17, 18] attempt to integrate the two tasks as shown in Fig.1 (b), some limitations and problems have still not been well explored: (1) the tasks of multi-object tracking and trajectory prediction are both executed in a streaming manner in actual deployments, while the training procedure of most previous methods is conducted in a snap-shot pattern, where the length of historical window is fixed and the long-term information can not be fully exploited efficiently. (2) In general, the coordinates representation of objects for tracking and prediction are different, where a unified coordinate system is needed in MOT for optimal association while most prediction methods adopt the agent-centric coordinate representation for each object to ensure pose-invariance. (3) Most methods focus on predicting the future trajectories of objects visible in current frame, inadvertently overlooking those lost because of either occlusions or miss from upstream perception, which may result in adversely affecting downstream tasks.

In this paper, we introduce StreamMOTP, a streaming framework for joint multi-object tracking and trajectory prediction as depicted in Fig.1 (c), where the tasks of MOT and trajectory prediction are jointly performed on successive frames. Specifically, we associate the newly perceived objects with historical tracklets and predict their future trajectories simultaneously. Different from previous works, the extracted latent features of objects are sequentially utilized in StreamMOTP as part of the representation for the subsequent tracked objects during the forward propagation phase. As for the back-propagation, the gradients are not confined to a single frame but are propagated through multiple frames, which greatly narrows the gap between training and online inference, allowing for a more comprehensive learning process by accounting for temporal dependencies across the entire sequence.

Concretely, we extend the pattern of training from single-frame to multi-frame and introduce a memory bank to maintain and update long-term latent features for tracked objects, thereby improving the model’s capability for long-term sequence modeling. Aiming to address the coordinate system discrepancy between the tasks of tracking and prediction, we propose a relative Spatio-Temporal Positional Encoding (STPE) strategy, which is applied to realize the compromise and unification of the different agent- and ego-centric representation in the two tasks. At the same time, based on the observation that there is an obvious overlap between the predicted trajectories of objects in consecutive adjacent frames as depicted in Fig.1 (c-left), we apply dual-stream predictor to effortlessly and elegantly generate future trajectories for both tracked and new-come objects simultaneously, which benefits to both tasks of MOT and trajectory prediction.

It should be pointed out that, with the design of the streaming and unified framework, StreamMOTP obtains the potential and advantages to handle more complex driving scenarios in actual applications. On the one hand, the predicted trajectories for tracked objects could help deal with the problem of occlusions at the current moment by marking the possible positions of obscured targets in the current frame, as shown in Fig.1 (c-middle). On the other hand, for the objects newly perceived in the current frame, StreamMOTP maintains the capability to predict their future trajectories by leveraging social interactions and contextual features stored in the memory bank while traditional prediction methods may fail due to the lack of historical information about them, as shown in Fig.1 (c-right).

The core contributions are summarized as follows:

•

We propose StreamMOTP, a joint MOT and trajectory Prediction model based on a streaming framework to bridge the gap between training and actual deployment. A memory bank for tracked objects is introduced in this framework for utilizing long-term features more effectively.
•

We introduce a spatio-temporal positional encoding strategy to construct the relative relationship between objects in different frames, which reaches the compromise and unification of inconsistent coordinate representation in tracking and prediction.
•

We design a dual-stream predictor to simultaneously predict the trajectories of objects in both the current and previous frames. The predicted trajectory from the previous frame can further assist in predicting newly perceived objects’ trajectories, which achieves better temporal consistency in trajectory prediction.
•

We get better performance for MOT and trajectory prediction on nuScenes, improving AMOTA by 3.84% and reducing minADE / minFDE by 0.220 / 0.141.

II RELATED WORK

II-A 3D Multi-Object Tracking

Existing multi-object tracking paradigms, such as tracking-by-detection (DeepSORT[19], AB3DMOT[3]), Joint Detection and Embedding learning (FairMOT[20], JDE[21]), and joint detection and tracking (Tracktor++[22], YONDTMOT[23]), typically rely on Kalman filters(KFs) to predict the positions of tracked objects for better-association. Yet, KFs require fine-tuning of parameters and struggle with occlusions (PC3TMOT[24], DeepFusionMOT[25]). In contrast, dedicated prediction tasks can provide superior short-term prediction results for tracking, especially in handling complex scenarios such as occlusions. Therefore, combining the two tasks of multi-object-tracking and trajectory prediction can effectively improve the overall performance of multi-object tracking. This combination not only reduces the dependence on traditional methods like KFs but also enhances the robustness and adaptability of the tracking methods.

II-B Trajectory Prediction

There has been significant progress in trajectory prediction recently. With the use of pooling [7], graph convolution [9], attention mechanism [13] [14], vector-based methods [10] can efficiently aggregate sparse information in traffic scenes. As the future is uncertain, some works (Multipath++ [12], HiVT [26]) predict multimodal future distribution by decoding a set of trajectories from scene context while others (DenseTNT [11]) generate multimodal prediction by leveraging anchors. Though these methods greatly improve trajectory prediction, most of them use GT past trajectories as input for training and testing, neglecting tracking error accumulation with imperfect inputs. Therefore, we handle the tasks of tracking and prediction jointly with no need for GT trajectories as predictor’s inputs to provide more robust predictions based on practical detectors in the real world.

II-C Joint Tracking and Prediction

In the last couple of years, there has been growing interest in joint tracking and prediction. For example, [27] refine the inputs for the predictor through a re-tracking module, MTP [15] propose multi-hypothesis data association to generate multiple sets of tracks for predictor simultaneously. Besides polishing the input tracklets for the prediction module, some studies combine the tasks of tracking and prediction with joint optimization. PTP [16] and PnPNet [28] uses the shared feature representation to address both tasks. AffinPred [17], TTFD [18] use affinity matrices rather than tracklets as inputs of the prediction module to improve the forecasting performance, but they sacrifice the capability to provide tracking results explicitly. However, almost all of these methods are performed in a snap-shot form and neglect the misalignment issue between tracking and prediction. Compared to those approaches, our method uses a streaming framework and a unified spatio-temporal positional encoding method to address the above problems.

III APPROACH

III-A Streaming Framework

Simply, let $\mathcal{D}=\{d_{1},\ldots,d_{N}\}$ represent the set of objects perceived in current frame from a 3D object detector, where $N$ denotes the number of objects. Concretely, each object at frame $t$ is represented as $d_{i}^{t}=[d_{i}^{\text{pos},t},d_{i}^{\text{size},t},d_{i}^{\text{head},t},d_% {i}^{\text{class},t},d_{i}^{\text{score},t}]$ where each element denotes the position, size, heading angle, class and confidence score from the module of detection, respectively. In this paper, the goal of joint 3D multi-object tracking and trajectory prediction includes two parts, to obtain the association of multiple obstacles in adjacent frames by assigning a unique track ID to each object, and meanwhile to predict the future trajectories $\mathcal{F}=\{f_{1},\ldots,f_{N}\}$ for all agents in current frame, with each element of a trajectory specified by a two-dimensional coordinate $(x,y)$ .

Based on the observation that the actual physical world is continuous and long-term history is essential for a safer autonomous driving system, we model the task of joint 3D multi-object tracking and trajectory prediction in a streaming manner (shown as Fig. 2). First of all, we extend the pattern of training from single-frame to multi-frame so as to narrow the gap between training and actual deployment. To be more specific, we introduce a Memory Bank for tracked objects to maintain long-term latent features for utilizing the long-term information more effectively, where the latent features are maintained through consecutive frames and could further benefit the performance of both tasks, including not only multi-object tracking but also trajectory prediction.

Specifically, the memory bank consists of $F\times N$ latent features where $F$ is the length of the memory bank and $N$ is the number of objects stored per frame. At each time, the latent feature of those tracked objects that have been associated with the new perceived objects in the current frame would be saved into the memory bank. These features are then utilized in subsequent frames to enhance features for tracked objects, detailed in Sec. III-B. The entrance and exit of the memory bank follow the first-in, first-out rule.

III-B Spatio-Temporal Encoder

Feature Extraction. To capture the semantic and motion information of the obstacles in the driving scenario efficiently and adequately, we conduct feature extraction for the tracked and new-come objects separately. For the perceived objects from adjacent frames, we use $d\in\mathbb{R}^{N_{p}\times C}$ and $\tau\in\mathbb{R}^{N_{t}\times C}$ to represent the semantic features, where $N_{p}$ and $N_{t}$ denote the number of objects at current frame $t$ (named as proposals) and previous frame $t-1$ (named as tracklets), respectively. At the same time, the historical trajectories of last $T_{h}$ frames for $N_{t}$ tracked objects are represented with $H\in\mathbb{R}^{N_{t}\times T_{h}\times C}$ . Simply and effectively, we deploy the Multi-Layer Perceptron (MLP) to encode the semantic information into high-dimension features and fuse the historical data $H$ to trakclets $\tau$ through a Multi-Head Cross Attention (MHCA) as:

F_{d}=\operatorname{MLP}(d),\tilde{F_{t}}=\operatorname{MLP}(\tau)+% \operatorname{MHCA}(\operatorname{MLP}(H))

(1)

where $\tilde{F_{t}}\in\mathbb{R}^{N_{t}\times D}$ , $F_{p}\in\mathbb{R}^{N_{p}\times D}$ , and $C$ , $D$ correspond to the dimension of the semantic and latent high-dimension features respectively.

Additionally, to equip our model with long-term temporal modeling capability, we exploit the latent features saved in the memory bank. Inspired by dynamic weight learning [29] [30], an ego transformation is applied to ensure the temporal alignment and effective feature usage across frames:

	$\displaystyle\alpha,\beta$	$\displaystyle=\operatorname{MLP}(E_{t}-E_{s})$		(2)
	$\displaystyle M$	$\displaystyle=\alpha\operatorname{LN}(\tilde{M})+\beta$		(2)

where Eq.2 is an affine transformation and its parameters are derived from the ego difference between two frames. Then we apply temporal aggregation of long-term latent memory maintained in the memory bank for each tracked object with $\operatorname{MHCA}$ and then fuse the latent memory feature with the extracted feature of tracklets as follows:

F_{t}=\tilde{F_{t}}+\operatorname{MHCA}(M)

(3)

Spatio-Temporal Positional Encoding. For the task of tracking, aligning all features within a unified coordinate system is essential for feature association. In contrast, for prediction tasks, previous research [12][26] have demonstrated the advantages of agent-centric representations, which normalize various trajectories to local coordinate systems centered on the selected agent. To bridge the gap between coordinate representation between tracking and prediction, we propose a relative Spatio-Temporal Positional Encoding (STPE) strategy. This approach differentiates between coordinate-independent and dependent features, using the former as query tokens for attention mechanism during feature interaction, while the latter is incorporated into attention through relative positional encoding.

To be specific, we encode the relative spatio-temporal position between object $i$ in previous frame $t$ (tracklet frame) and object $j$ in current frame $p$ (proposal frame) as follows:

\delta_{ij}^{tp}=\operatorname{MLP}([p_{j}^{p}-p_{i}^{t},\theta_{j}^{p}-\theta% _{i}^{t}])

(4)

Attentional Spatio-Temporal Interaction. Based on relative embedding from the spatio-temporal positional encoding strategy, we fuse the features of proposals and tracklets with cross-attention and self-attention iteratively. Take the proposal branch as an example, we use query-centric attention with a spatio-temporal positional encoding strategy, incorporating the relative positional embedding into key/value of the attention mechanism:

	$\displaystyle F_{i}^{p\prime}$	$\displaystyle=\operatorname{MHCA}\left(\mathbf{Q}=F_{i}^{p},\mathbf{K/V}=\{F_{% j}^{t}+\delta_{ij}^{tp}\}_{j\in N_{i}}\right)$		(5)
	$\displaystyle F_{i}^{p\prime}$	$\displaystyle=\operatorname{MHSA}\left(\mathbf{Q}=F_{i}^{p},\mathbf{K/V}=\{F_{% j}^{p}+\delta_{ij}^{p}\}_{j\in N_{i}}\right)$		(5)

As shown in Eq. 5, we first employ cross-attention to fuse the tracked objects’ information from previous frames into new-come objects from current frames. Subsequently, self-attention is utilized within the current frame to foster awareness among detected objects in this frame. This process enables the features of newly perceived objects to incrementally assimilate comprehensive information, enriching their contextual awareness. We denote the result of this branch as proposals context feature $F_{i}^{p^{\prime}}$ . Similarly, The tracklet branch undergoes the same propagation and gets tracklets context feature $F_{i}^{t^{\prime}}$ in parallel.

III-C MOT Head

Association with Optimal Transport. The core purpose of the MOT head for StreamMOTP is to associate the M-tracked objects in the previous frame and the N-perceived objects in the current frame. To find the association relationship, we learn an affinity matrix $A^{\left(\text{tp}\right)}\in\mathbb{R}^{N_{t}\times N_{p}}$ based on the tracklets context feature and proposals context feature after feature interaction. We use Dot Product to calculate the similarity pair, so each entry $A^{\left(\text{tp}\right)}_{ij}$ represents the similarity score between the tracked object $i$ and the detected object $j$ .

A^{\left(\text{tp}\right)}_{ij}=\frac{\langle F^{t^{\prime}}_{i},F^{p^{\prime}% }_{j}\rangle}{\sqrt{D}},\forall(i,j)\in N_{t}\times N_{p}

(6)

where $D$ is the dimension of the context feature.

Given the affinity matrix, we get the optimal affinity matrix $A^{(\text{opt})}\in\mathbb{R}^{(N_{t}+1)\times(N_{p}+1)}$ through log sinkhorn algorithm as SuperGlue [31], which performs differentiable optimal transport in log-space for stability. Under our streaming framework, the use of the log sinkhorn algorithm allows the model to modify the model parameters of previous frames while optimizing subsequent frames for continuous tracking and prediction. The last row and the last column of $A^{(\text{opt})}$ respectively represent newly appeared objects and tracklets without corresponding matched objects.

Tracking Loss. We supervise the output affinity matrix $A^{\text{(opt)}}$ with the ground truth (GT) relationship represented by the matrix $A^{\text{(g)}}\in\mathbb{R}^{(N_{t}+1)\times(N_{p}+1)}$ . The accuracy of $A^{\text{(opt)}}$ is judged by how closely its high-value elements align with the ones in $A^{\text{(g)}}$ . Therefore, we use the following loss:

\mathcal{L}_{\text{tracking}}=-\frac{1}{N_{m}}\cdot(A^{\operatorname{(opt)}}e^% {-U}+U)\cdot A^{(\text{g})}

(7)

where the uncertainty matrix $U\in\mathbb{R}^{(N_{t}+1)\times(N_{p}+1)}$ is derived from tracklets and proposals feature to ensure the robustness of training, and $N_{m}$ is the number of matching pairs in $A^{\text{(}g)}$ . Finally, we get association relationship $A$ from $A^{\text{(opt)}}$ .

III-D Dual-Stream Predictor

The predictor predicts all agents’ multi-modal future trajectories. The detail of the predictor is shown in Fig. 3.

Single Frame Prediction. To jointly predict all future trajectories for perceived objects in the current frame, we utilize a transformer-based decoder that incorporates the previous encoded context feature by learnable intention queries. To combine the advantages of the prior acceleration of convergence provided by the anchor-based model[13] and the high flexibility of the anchor-free model[12], we combine learnable tokens and anchors to form the query:

Q_{p}^{l}=I+\phi(A_{T})+\phi(\hat{x}_{T}^{l-1})

(8)

where $Q_{p}^{l}\in\mathbb{R}^{N_{p}\times K\times D}$ is the query input at the current frame and layer $l$ decoder, which is combined from a learnable embedding $I$ , the endpoints of the anchors $A_{T}$ , and the predicted endpoints of previous layer $\hat{x}_{T}^{l-1}$ , which are fused through $\phi$ (a sinusoidal position encoding followed by an $\operatorname{MLP}$ ). Next, to aggregate features from context embedding, we perform attention mechanism on the temporal and social dimensions to get multi-modal prediction output.

Dual-Stream Predictor. It is obvious that the predictions for previously tracked objects and currently perceived objects share a large overlap on those matched objects. As shown in Fig.4, the $T_{f}$ +1 predictions from frame $t$ -1 should be consistent with the $T_{f}$ predictions from current frame $t$ in the last $T_{f}$ frames. And it’s much more feasible to generate consecutive output trajectories with the streaming nature of the proposed framework of StreamMOTP.

Based on the observations, we propose a dual-stream predictor to improve the quality and temporal consistency of the predicted trajectories. The predictor comprises two branches: a primary branch focuses on making predictions for the detected objects in current frame and a supportive auxiliary branch for the previous tracked objects. The primary branch follows Single Frame Prediction to predict from the context features of proposals, while the auxiliary branch leverages the context features of tracklets to generate $K$ adaptive predictions $\hat{Y_{t}}\in\mathbb{R}^{N_{t}\times K\times(T_{f}+1)\times 2}$ specific to the tracked objects. Since the prediction result $\hat{Y_{t}}$ from the tracklet frame and $\hat{Y}_{p}$ from the proposal frame have $T_{f}$ overlap**, using $\hat{Y_{t}}$ to guide the prediction of $\hat{Y_{p}}$ enhances both accuracy and temporal coherence of the predicted trajectory. Specifically, we encode and map the overlap** $T_{f}$ frame of $\hat{Y_{t}}$ to yield auxiliary features $F_{Y_{t}}\in\mathbb{R}^{N_{p}\times K\times T_{f}\times D}$ :

F_{Y_{t}}=\operatorname{MLP}(\operatorname{PE}(A^{T}\hat{Y_{t}}))

(9)

where $\operatorname{PE}(\cdot)$ denotes sinusoidal position encoding, $A\in\mathbb{R}^{N_{t}\times N_{p}}$ denotes the association matrix given by MOT head.

In addition to Single Frame Prediction, auxiliary features and anchor queries from the current frame are aggregated together in our dual-stream predictor. We adopt multi-head cross attention, taking the anchor embedding from the current frame as query, and the prediction features from the auxiliary tracklet branch as key and value:

Q=\operatorname{MHCA}(\mathbf{Q}=Q,\mathbf{K/V}=F_{Y_{t}})

(10)

We place Eq. 10 after the interaction between queries and proposals context features, while before the self-attention of the queries, making the queries interact sequentially with historical features, future features, and the social context.

Multi-modal Prediction with Gaussian Mixture Model. As the future behaviors of the agents are highly multi-modal, we follow [12] to represent the distribution of predicted trajectories with Gaussian Mixture Model (GMM):

f\left(\left\{\mathbf{Y}_{i}^{t}\right\}_{t=1}^{T_{f}}\right)=\sum_{h=1}^{K}p_% {i,k}\prod_{t=1}^{T_{f}}\text{ GMM }\left(\mathbf{Y}_{i}^{t}\mid\boldsymbol{% \mu}_{i,k}^{t},\mathbf{\sigma}_{i,k}^{t}\right)

(11)

where $\left\{p_{i,k}\right\}_{k=1}^{K}$ is the probability distribution between $K$ modes, and the $k-$ th mixture component’s Gaussian density for agent $i$ at time step $t$ is parameterized by $\mu_{i,k}^{t}$ and $\sigma_{i,k}^{t}$ . Given Eq. 11 for all predicted steps, we adopt negative log-likelihood loss and supervised predictions for new-come objects in the current frame and predictions for the tracked objects simultaneously. Loss can be formulated as:

\mathcal{L}_{\text{prediction}}=-\log f(\hat{Y_{p}})-\log f(\hat{Y_{t}})

(12)

Then, the final loss of our model is denoted as:

\mathcal{L}=\lambda\mathcal{L}_{\text{tracking}}+\mathcal{L}_{\text{prediction}}

(13)

where $\lambda\in\mathbb{R}_{>0}$ is the weight for tracking loss to balance the the joint optimization of the two tasks.

IV EXPERIMENTS

TABLE I: Comparison with existing approaches (on nuScenes). All results is based on detections from Megvii.

(a) 3D MOT Performance

Methods	AMOTA $\uparrow$	MOTA $\uparrow$
mmMOT [1]	23.93	19.82
GNN3DMOT [2]	29.84	23.53
AB3DMOT [3]	39.90	31.40
PTP [16]	42.36	32.06
StreamMOTP	46.30	40.50

(b) One Step MOTP Performance

Methods	minADE $\downarrow$	minFDE $\downarrow$
Social-GAN [7]	1.794	2.850
TraPHic [8]	1.827	2.760
Graph-LSTM [32]	1.646	2.445
PTP [16]	1.017	1.527
StreamMOTP	0.810	1.481

Methods	minADE $\downarrow$	minFDE $\downarrow$
PTP [16]	2.320	3.819
MTP(S=10) [16]	1.585	2.512
MTP(S=200)	1.325	1.979
AffinPred [17]	0.977	1.628
StreamMOTP	0.757	1.487

TABLE II: Ablation study on the components of StreamMOTP.

Memory Bank	STPE	Stream Predictor	AMOTA	AMOTP	MOTA	minADE	minFDE	MR	tc
			0.523	0.781	0.426	0.572	0.942	0.113	-
	✓	✓	0.556	0.770	0.466	0.384	0.594	0.075	-
✓		✓	0.528	0.782	0.431	0.524	0.838	0.103	-
✓	✓		0.544	0.768	0.456	0.488	0.776	0.098	2.081
✓	✓	✓	0.556	0.779	0.472	0.377	0.586	0.072	1.942

IV-A Experimental Setup and Implementation Details

Dataset and Metrics. The proposed method is evaluated on the popular nuScenes dataset. Following the standard practices [33] of nuSences dataset, we predict trajectories for objects perceived in the current frame and use the distance threshold of 2m to match them with GT future trajectories. In the task of trajectory prediction, the models predict future trajectories for 3s and 6s to align with other works, with a time interval of 0.5s, based on 2s historical data. As for the task of MOT, We employ the commonly-used AMOTA, MOTA, and AMOTP for evaluation. And standard minADE and minFDE metrics are used to evaluate the prediction performance. Moreover, we design the metric of ‘tc’ to evaluate the temporal consistency, which is calculated as the ADE in $T_{f}-1$ overlap** frames between predictions from $T$ to $T+T_{f}$ and predictions from $T-1$ to $T+T_{f}-1$ .

Inputs. In StreamMOTP, input data is formatted in a sequential format. During training, we split the streaming video into training slices and use a sliding window to sequentially get the inputs at each timestamp. To address detector noise, we incorporate the detected results and employ the ground truth (GT) matching relationships up to the $(t-1)$ -th frame to create history tracks. Newly perceived objects without association in the current frame serve as proposals. In online inference, the model takes raw detections as input to perform tracking and prediction jointly.

Training. To avoid poor latent memories which may impede the training procedure in early stages, scheduled sampling [34] is applied to the memory bank. We train our model for 180 epochs. Specifically, features in the memory bank are selected through sampling, and the sampling rate starts to increase at epoch 30, following a sigmoid curve.

IV-B Comparison with Related Work

Table I(c) compares StreamMOTP with other methods in tracking and prediction, using the same Megvii[35] detector for fairness. For MOT, we evaluate all categories, while for trajectory prediction, we adopt two settings from prior studies: (1) Setting1: One Step MOTP. In Setting1, we follow a single-step tracking and 3s prediction, similar to PTP [16]. The model uses GT past trajectories $t\in\{T_{c}-T_{h},\cdots,T_{c}-1\}$ and GT detections in the current frame $T_{c}$ , conducts MOT at the current frame, and forecasts future trajectories in frames $t\in\{T_{c}+1,\cdots T_{c}+T_{f}\}$ . Results for all classes from the nuScenes Prediction Challenge are reported. This setting is more suitable for Vehicle-to-Vehicle (V2V) scenario. (2) Setting2: Multi Step MOTP. In setting2, we perform standard tracking and 6s prediction for detected objects in $T_{c}$ , based on their tracked histories, and evaluate prediction results on all vehicle classes. This setting aligns more closely with the current stage of autonomous driving and is more widely adopted in industry deployments.

Our model surpasses previous related work in both tasks of multi-object tracking and trajectory prediction. In MOT performance, shown in Table I(a), our model not only achieves gains over PTP baseline [16] with improvements of 3.94% in AMOTA and 8.44% in MOTA, but also surpasses several competing trackers. Table I(b) shows the prediction comparison for one-step MOTP. Our model reaches the lowest minADE of 0.810 and minFDE of 1.481, which outperforms PTP [16] by 0.207 on minADE and 0.046. Moreover, Table I(c) offers a comparison of multi-step MOTP’s predictions, where our model attains state-of-the-art performance with a minADE of 0.757 and a minFDE of 1.487, outperforming AffinPred [17] by 0.220 and 0.141, respectively. The improvements in Table I(c) are more obvious than in Table I(b) for the reason that trajectory prediction in setting1 is more saturated than in setting2, indicating the larger growth potential for prediction based on tracked trajectory.

TABLE III: The effect of training slice length (Abbreviated as ”Slice”) and memory bank (Abbreviated as ”Mem”).

Slice	Mem	AMOTA	MOTA	minADE	minFDE	MR
3		0.570	0.490	0.633	0.953	0.137
5		0.560	0.478	0.402	0.621	0.075
10		0.557	0.466	0.384	0.594	0.075
3	✓	0.570	0.486	0.537	0.813	0.119
5	✓	0.564	0.478	0.392	0.602	0.072
10	✓	0.556	0.472	0.377	0.586	0.072

TABLE IV: Ablation study of Memory Bank in Slice=3.

Memory Length	AMOTA	MOTA	minADE	minFDE	MR
0	0.570	0.490	0.633	0.953	0.137
1	0.569	0.487	0.603	0.921	0.135
2	0.570	0.486	0.537	0.813	0.120

IV-C Ablation Studies

We evaluated the impact of each module within our StreamMOTP framework, as summarized in Tabel II, where the bottom row represents the full implementation of our method. All models are experimented on Setting2, except that the detector is switched to CenterPoint[36] and 3s prediction metrics are computed on True Positive detections at a recall rate of 0.6. The Megvii detector, being an older model, exhibits subpar detection capabilities. Therefore, we switch to a detector with relatively moderate performance to better measure each module’s efficacy.

Effects of each module. Firstly, upon removing the memory bank, we observed a slight decline in performance for both tracking and prediction tasks. We will further explore its impact later. Secondly, we remove the spatio-temporal positional encoding in the spatio-temporal interaction module and encode the absolute coordinate feature in the same way as the attribute feature. There is a significant drop in performance for both tasks of tracking and prediction, which shows that spatio-temporal positional encoding maintains the pose-invariance for trajectory predictions and effectively addresses the issue of inconsistent coordinate representations. Thirdly, we replace the dual streaming predictor with a single frame predictor performed only on the current frame. The second-last row shows that the dual-stream predictor plays a vital role in advancing prediction performance. The modest decrease in tracking further corroborates that augmenting prediction capabilities also benefits tracking results. Notably, the tc metric also drops when the dual-stream predictor is eliminated, which indicates that the dual-stream predictor enhances the trajectory predictions’ quality and consistency. The reason is that in two consecutive frames, predictions from previous frames serve as a valuable prior reference for predicting current perceived objects’ trajectories, which helps to yield more viable and steady outcomes.

Effects of streaming framework. The effectiveness of the streaming framework and the memory bank is explored by adjusting the lengths of training segments. In Table III, tracking performance stays consistent, whereas prediction accuracy significantly benefits from longer training slices due to its dependence on sequential and extensive sequential information. This finding stems from the gap that our models are trained in split slices (multi-frame sequences of length $k$ ) but evaluated in streaming video (the average length is 40 in nuScenes, $k\ll 40$ ). This gap constrains the effectiveness of approaches, especially for previous snap-shot methods. Our streaming framework narrows this gap between the segmented training approach and continuous video inference by utilizing temporal information over successive frames, thus enhancing prediction performance. Moreover, the integration of the memory bank, particularly with shorter slices, markedly boosts prediction accuracy by the retention and utilization of long-term latent features in the memory bank, therefore improving the model’s capability for long-term sequence modeling. This is crucial under resource constraints that limit slice length and temporal receptive field. Furthermore, Table IV shows that as the length of the memory bank expands, the model’s performance grows, which further demonstrates the impact of the memory bank.

TABLE V: Model performance on varying detectors.

Detectors	AMOTA	AMOTP	MOTA	minADE	minFDE	MR
Megvii	0.463	0.997	0.405	0.470	0.751	0.096
CenterPoint	0.556	0.779	0.472	0.377	0.586	0.072

Generalization performance on different detectors. We applied our model with different detectors and summarized the result in TableV. The significant growth of CenterPoint compared to Megvii in tracking and 3s prediction underscores our model’s strong generalization ability, independent of specific detectors. It is anticipated that the model will achieve superior performance with advanced detectors.

IV-D Qualitative Results

We provide some qualitative results in Fig. 5 to show our predictions. There is a brand new object without historical trajectory perceived at frame $t$ . StreamMOTP successfully predicts its future trajectory with social interactions. Moreover, by comparing the two rows, we can see that all mode predictions in the top row are smoother and more precise, and the highest score of the predictions fluctuates less.

V CONCLUSIONS

In this paper, we introduce StreamMOTP, a streaming and unified framework for joint multi-object tracking and trajectory prediction. With the design of the memory bank, spatio-temporal positional encoding strategy, and dual-stream predictor, streamMOTP bridges the gap between training and actual deployment, as well as maintains better capability and great potential for both tasks of multi-object tracking and trajectory prediction. The experiments on nuSences demonstrate the effectiveness and superiority of the proposed framework. We hope this work could further offer insights into the multi-task end-to-end autonomous driving systems.

References

[1] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, “Robust multi-modality multi-object tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2365–2374.
[2] X. Weng, Y. Wang, Y. Man, and K. M. Kitani, “Gnn3dmot: Graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6499–6508.
[3] X. Weng, J. Wang, D. Held, and K. Kitani, “Ab3dmot: A baseline for 3d multi-object tracking and new evaluation metrics,” arXiv preprint arXiv:2008.08063, 2020.
[4] X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C.-L. Tai, “Transfusion: Robust lidar-camera fusion for 3d object detection with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1090–1099.
[5] L. Wang, X. Zhang, W. Qin, X. Li, J. Gao, L. Yang, Z. Li, J. Li, L. Zhu, H. Wang et al., “Camo-mot: Combined appearance-motion optimization for 3d multi-object tracking with camera-lidar fusion,” IEEE Transactions on Intelligent Transportation Systems, 2023.
[6] X. Li, T. Xie, D. Liu, J. Gao, K. Dai, Z. Jiang, L. Zhao, and K. Wang, “Poly-mot: A polyhedral framework for 3d multi-object tracking,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 9391–9398.
[7] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social gan: Socially acceptable trajectories with generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2255–2264.
[8] R. Chandra, U. Bhattacharya, A. Bera, and D. Manocha, “Traphic: Trajectory prediction in dense and heterogeneous traffic using weighted interactions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8483–8492.
[9] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan, “Semantic object parsing with graph lstm,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 2016, pp. 125–143.
[10] J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid, “Vectornet: Encoding hd maps and agent dynamics from vectorized representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 525–11 533.
[11] J. Gu, C. Sun, and H. Zhao, “Densetnt: End-to-end trajectory prediction from dense goal sets,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 303–15 312.
[12] B. Varadarajan, A. Hefny, A. Srivastava, K. S. Refaat, N. Nayakanti, A. Cornman, K. Chen, B. Douillard, C. P. Lam, D. Anguelov et al., “Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 7814–7821.
[13] S. Shi, L. Jiang, D. Dai, and B. Schiele, “Motion transformer with global intention localization and local movement refinement,” Advances in Neural Information Processing Systems, vol. 35, pp. 6531–6543, 2022.
[14] Z. Zhou, J. Wang, Y.-H. Li, and Y.-K. Huang, “Query-centric trajectory prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 863–17 873.
[15] X. Weng, B. Ivanovic, and M. Pavone, “Mtp: Multi-hypothesis tracking and prediction for reduced error propagation,” in 2022 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2022, pp. 1218–1225.
[16] X. Weng, Y. Yuan, and K. Kitani, “Ptp: Parallelized tracking and prediction with graph neural networks and diversity sampling,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4640–4647, 2021.
[17] X. Weng, B. Ivanovic, K. Kitani, and M. Pavone, “Whose track is it anyway? improving robustness to tracking errors with affinity-based trajectory prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6573–6582.
[18] P. Zhang, L. Bai, Y. Wang, J. Fang, J. Xue, N. Zheng, and W. Ouyang, “Towards trajectory forecasting from detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[19] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in 2017 IEEE international conference on image processing (ICIP). IEEE, 2017, pp. 3645–3649.
[20] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “Fairmot: On the fairness of detection and re-identification in multiple object tracking,” International Journal of Computer Vision, vol. 129, pp. 3069–3087, 2021.
[21] Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang, “Towards real-time multi-object tracking,” in European conference on computer vision. Springer, 2020, pp. 107–122.
[22] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells and whistles,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 941–951.
[23] X. Wang, J. He, C. Fu, T. Meng, and M. Huang, “You only need two detectors to achieve multi-modal 3d multi-object tracking,” arXiv preprint arXiv:2304.08709, 2023.
[24] H. Wu, W. Han, C. Wen, X. Li, and C. Wang, “3d multi-object tracking in point clouds based on prediction confidence-guided data association,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 5668–5677, 2021.
[25] X. Wang, C. Fu, Z. Li, Y. Lai, and J. He, “Deepfusionmot: A 3d multi-object tracking framework based on camera-lidar fusion with deep association,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 8260–8267, 2022.
[26] Z. Zhou, L. Ye, J. Wang, K. Wu, and K. Lu, “Hivt: Hierarchical vector transformer for multi-agent motion prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8823–8833.
[27] R. Yu and Z. Zhou, “Towards robust human trajectory prediction in raw videos,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 8059–8066.
[28] M. Liang, B. Yang, W. Zeng, Y. Chen, R. Hu, S. Casas, and R. Urtasun, “Pnpnet: End-to-end perception and prediction with tracking in the loop,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 553–11 562.
[29] S. Wang, Y. Liu, T. Wang, Y. Li, and X. Zhang, “Exploring object-centric temporal modeling for efficient multi-view 3d object detection,” arXiv preprint arXiv:2303.11926, 2023.
[30] G. Aydemir, A. K. Akan, and F. Güney, “Adapt: Efficient multi-agent trajectory prediction with adaptation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8295–8305.
[31] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947.
[32] R. Chandra, T. Guan, S. Panuganti, T. Mittal, U. Bhattacharya, A. Bera, and D. Manocha, “Forecasting trajectory and behavior of road-agents using spectral clustering in graph-lstms,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4882–4890, 2020.
[33] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
[34] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” Advances in neural information processing systems, vol. 28, 2015.
[35] B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu, “Class-balanced grou** and sampling for point cloud 3d object detection,” arXiv preprint arXiv:1908.09492, 2019.
[36] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 784–11 793.