\pdftrailerid

redacted \reportnumber001

Model Predictive Simulation Using Structured Graphical Models and Transformers

Xinghua Lou Google DeepMind Meet Dave Google DeepMind Shrinu Kushagra Google DeepMind Miguel Lázaro-Gredilla Google DeepMind Kevin Murphy Google DeepMind

Abstract

We propose an approach to simulating trajectories of multiple interacting agents (road users) based on transformers and probabilistic graphical models (PGMs), and apply it to the Waymo SimAgents challenge. The transformer baseline is based on the MTR model (Shi et al., 2024), which predicts multiple future trajectories conditioned on the past trajectories and static road layout features. We then improve upon these generated trajectories using a PGM, which contains factors which encode prior knowledge, such as a preference for smooth trajectories, and avoidance of collisions with static obstacles and other moving agents. We perform (approximate) MAP inference in this PGM using the Gauss-Newton method. Finally we sample $K=32$ trajectories for each of the $N\sim 100$ agents for the next $T=8\Delta$ time steps, where $\Delta=10$ is the sampling rate per second. Following the Model Predictive Control (MPC) paradigm, we only return the first element of our forecasted trajectories at each step, and then we replan, so that the simulation can constantly adapt to its changing environment. We therefore call our approach "Model Predictive Simulation" or MPS. We show that MPS improves upon the MTR baseline, especially in safety critical metrics such as collision rate. Furthermore, our approach is compatible with any underlying forecasting model, and does not require extra training, so we believe it is a valuable contribution to the community.

keywords:

Multi-Agent Planning, Probabilistic Graphical Model, Model Predictive Control, Transformer

1 Introduction

The use of transformers to create generative models to simulate agent trajectories, trained on large datasets such as Waymo Open Data (Ettinger et al., 2021), has become very popular in recent years. Most previous work has been focusing on improving the architecture (Nayakanti et al., 2023; Shi et al., 2024), the training objective (Ngiam et al., 2021; Shi et al., 2024), the trajectory representation (Seff et al., 2023; Philion et al., 2023) or the speed (Zhou et al., 2023) of these transformer-based models.

This paper tackles the problem from an orthogonal and complementary angle – namely the use of prior knowledge, encoded using a probabilistic graphical model (PGM). We perform approximate MAP inference in the PGM to “post process” the trajectory proposals from a base transformer model, to increase their realism and compliance with constraints, such as collision avoidance.

To ensure that our predicted forecasts are adaptive to the changing environment, we replan at each step, following the principle of model predictive control (MPC), which is widely used for controlling complex dynamical systems (Schwenzer et al., 2021). We therefore call our approach Model Predictive Simulation (MPS).

Our MPS approach differs from previous PGM methods for trajectory simulation, such as JFP (Luo et al., 2023), in several ways. First, we explicitly include (data-dependent) factors for collision avoidance and smooth trajectories, so we have better control over the generated trajectories. Second, our approach is iterative (being based on MPC), while JFP commits to the trajectory proposals at $t=0$ and is thus open loop. Third, our approach uses the Gauss-Newton method to compute the joint MAP estimate, whereas JFP is based on discrete belief propagation methods to choose amongst a finite set of candidate trajectories.

2 Method

Outer loop

Input: Scene context

\mathbf{c}

, num. agents

N

, num. samples

K

, trajetory length

T

Output: Sampled trajectories,

h_{1:N}^{1:K,1:T}

for $k=1$ to $K$ do

s_{1:N}^{k,0}=\text{init-trajectory}(\mathbf{c})

for $t=1$ to $T$ do

Sample

r_{1:N}^{k,t}=\text{MPS}(\mathbf{c},s_{1:N}^{k,1:t-1})

Extend

s_{1:N}^{k,1:t}=\text{append}(s_{1:N}^{k,1:t},r_{1:N}^{k,t})

end for

Algorithm 1 SimAgents outer loop

The overall simulation pseudocode is shown in Algo. 1. It generates a set of $K=32$ trajectories, each of length $T=80$ , for $N$ agents given the scene context $\mathbf{c}$ . (The exact value of $N$ depends on the number of agents that are visible in $\mathbf{c}$ .) We denote the generated output by $s_{1:N}^{1:K,1:T}$ , where $s_{i}^{k,t}=[x,y,\dot{x},\dot{y}]$ is the state (2d location and velocity) of the $i$ ’th agent in sample $k$ .

Inner loop

Input: Scene context

\mathbf{c}

, agent history

h_{1:N}

, num. agents

N

, future planning horizon

F

, number of rollouts

J

, transformer proposal

\pi

Output: Predicted next state for each agent,

r_{1:N}

for $j=1$ to $J$ do

Sample

(a_{1:N}^{j,1:F},g_{1:N}^{j})\sim\pi(\mathbf{c},h_{1:N},F)

G^{j}=\text{BuildFactorGraph}(a_{1:N}^{j,1:F},g_{1:N}^{j},\mathbf{c})

Initialize

s_{1:N}^{j,1:F}=a_{1:N}^{j,1:F}

(s_{1:N}^{j,1:F},E^{j})=\text{Inference}(G^{j},s_{1:N}^{j,1:F})

end for

Sample

j^{*}\sim\text{SoftMin}(E^{1:J})

return

s_{1:N}^{j^{*},1}

Algorithm 2 Model Predictive Simulation

At each step $t$ , the simulator calls our MPS algorithm to generate a prediction for the next state of each agent. The pseudocode for this is shown in Algo. 2. The approach is as follows. First we use the MTR transformer model $\pi$ (Shi et al., 2024) to sample a set of $N$ goal locations, $g_{1:N}^{j}$ , one for each agent, as well as a sequence of anchor points leading to each goal, $a_{1:N}^{j,1:F}$ , where $F$ is the planning or forecast horizon. We do this $J=60$ times in parallel, to create a set of possible futures. We then use the PGM to generate $J$ joint trajectories (for all $N$ agents), using the method described below. Finally we evaluate the energy of each generated trajectory, $E^{j}$ , sample one of the low energy (high probability) ones to get $s_{1:N}^{j^{*},1:F}$ , and return the first step of this sampled trajectory, $s_{1:N}^{j^{*},1}$ .

Graphical model

Refer to caption — Figure 1: Factor Graph for $N=2$ agents unrolled for $T$ planning steps. Circles are random variables, gray squares are fixed factors.

	$\displaystyle p(\mathbf{s}^{1:F}_{1..N}\mid\mathbf{c},\mathbf{a}_{1:N}^{1:F},g% _{1:N})\propto$	$\displaystyle\prod_{i=1}^{N}\left[f_{G}(s_{i}^{F}\|g_{i})\cdot\prod_{t=1}^{F-1}% f_{M}(s_{i}^{t}\|a_{i}^{t})\right]\cdot$
		$\displaystyle\prod_{i=1}^{N}\left[\prod_{t=2}^{F}f_{L}(s_{i}^{t-1},s_{i}^{t})% \cdot f_{A}(s_{i}^{t-1},s_{i}^{t})\right]\cdot$
		$\displaystyle\prod_{i=1}^{N}\prod_{t=1}^{F}f_{O}(s_{i}^{t}\mid\mathbf{c})\cdot% \prod_{i\neq j}\prod_{t=1}^{F}f_{C}(s_{i}^{t},s_{j}^{t})$

Figure 2: Joint probability model.

The key to our method is the probabilistic graphical model (PGM) for improving upon the proposed trajectories by MTR. The factor graph is shown in Fig. 1 and the corresponding conditional joint distribution is given in Fig. 2. The model was inspired by (Patwardhan et al., 2022) who uses Gaussian belief propagation. We now explain each of the factors.

First we have factors which compare a candidate trajectory to the original proposal. The motion factor is defined as $f_{\mathrm{M}}(\mathbf{s}_{i}^{t})=|s_{i}^{t}-a_{i}^{t}|$ , where $a_{i}^{t}$ is the predicted location (anchor point) for agent $i$ at time $t$ as computed by $\pi$ . This ensures the trajectory stays close to the initial proposal. The proximity to goal factor is defined as $f_{\mathrm{G}}(s_{i}^{F})=|s_{i}^{F}-g_{i}|$ , where $g_{i}$ is the goal for agent $i$ predicted by $\pi$ . This ensures the trajectory ends close to where we expect.

Second we have factors defined from "physics". We define a factor that penalizes deviation from linear motion: $f_{\mathrm{L}}(s,s^{\prime})=|s^{\prime}_{xy}-(s_{xy}+s_{\dot{x}\dot{y}}\Delta% _{t})|$ , where $s_{xy}$ are the location components of $s$ , $s_{\dot{x}\dot{y}}$ are the velocity components of $s$ , and $\Delta t$ is the sampling rate. We also define a factor that penalizes change in direction: $f_{\mathrm{A}}(s,s^{\prime})=|s_{\dot{x}\dot{y}}-s^{\prime}_{\dot{x}\dot{y}}|$ . We used weight 2.0 for $f_{\mathrm{A}}$ and 1.0 for all other factors.

Third we have factors derived from static obstacles on the road: $f_{\mathrm{O}}(s\mid\mathbf{c})=\max_{(x,y)\in\mathbf{c}_{\mathrm{RE}}}\mathrm% {G}(x,y\mid s)$ , where $\mathbf{c}_{\mathrm{RE}}$ represents the coordinates of the road edges (part of the context $\mathbf{c}$ ) and $\mathrm{G}(x,y\mid s)$ is a Gaussian field centered and rotated according to the agent’s location $s_{xy}$ .

Finally, we have pairwise collision factors between agents: $f_{\mathrm{C}}(s,s^{\prime})=\max_{(x,y)\in\mathrm{CCP}(s^{\prime})}\mathrm{G}% (x,y\mid s)$ , where $\mathrm{G}(x,y\mid s)$ is a Gaussian field for agent $s$ , and $\mathrm{CCP}(s^{\prime})$ are the 9 collision checking points (CCP) for the other agent $s^{\prime}$ (4 corners, 4 centers of the sides, and center of the agent).

Inference

Inference on the factor graph is equivalent to minimizing a non-linear, non-convex quadratic optimization problem defined over $\mathbf{s}_{1:N}^{1:F}$ . For efficiency reasons, we developed a two-step approach. First, we use the Gauss–Newton method to solve a partial model that only consists of $f_{\mathrm{M}}$ , $f_{\mathrm{L}}$ and $f_{\mathrm{A}}$ factors, as these can all be evaluated in parallel across agents using $N$ individual trajectory models. This step produces smoothed trajectories, which are then frozen. Second, we sample joint trajectories for agents according to their probability (unnormalized energy), and use the $f_{\mathrm{O}}$ and $f_{\mathrm{C}}$ factors to score their quality. After repeating this $J$ times, the best joint trajectories are sampled from a softmin operation over the scores of the $J$ samples.

WAYMO

META METRIC

KINEMATIC

INTERACTIVE

MAP

LEADERBOARD

REALISM

LINEAR

SPEED

LINEAR

ACCEL.

ANG.

SPEED

ANG.

ACCEL.

DIST.

TO OBJ.

COLLISION

TTC

DIST.

TO ROAD

OFFROAD

minADE

\downarrow

SMART

0.7511

0.3646

0.4057

0.4231

0.5844

0.3769

0.9655

0.8317

0.6590

0.9362

1.5447

MVTE

0.7301

0.3506

0.3530

0.4974

0.5999

0.3742

0.9049

0.8309

0.6655

0.9071

1.6769

MPS (Ours)

0.7416

0.3137

0.3049

0.4705

0.5834

0.3593

0.9629

0.8070

0.6651

0.9366

1.4841

Table 1: WOSAC Leaderboard: SMART (2024 winner) Vs. MVTE (2023 winner) Vs. MPS (ours).

3 Experimental Evaluation

Benchmark We evaluated MPS on the 2024 Waymo Open Sim Agents Challenge (Montali et al., 2023), where the task is simulating 32 realistic rollouts of all agents in the scene given their 1s history for 8s into the future. The simulation needs to be closed-loop and factorized between the ADV and other agents, which MPS satisfies naturally.

Implementation Details We implemented the factors and the inference in JAX ¹¹1https://github.com/google/jax and JAXopt ²²2https://jaxopt.github.io/ for the Gauss-Newton method. We leveraged JAX’s just-in-time (JIT) compilation and observed great scalability. For speed up, we take 10 immediate next steps at each MPS iteration.

We trained our own MTR model $\pi$ using the open source code ³³3https://github.com/sshaoshuai/MTR . We removed local attention and reduced the source polylines to 512. The training data is augmented by adding extra interacting agents, and by applying random history dropouts. We followed the original training setup except the number of epochs (50), the batch size (8) and the LR schedule ([25, 30, 35, 40, 45]). Training took about 3 days on 16 A100s.

We used only the official Waymo Open Motion Dataset v1.2.1 and did not use any Lidar or Camera data. We did not need any additional training and we did not use ensembles.

Sim Agents 2024 Results We ranked number 4 among all methods (Table 1). We outperformed the 2023 winner MVTE (Wang et al., 2023) which also uses MTR (Shi et al., 2024), and are approximately 1 point behind the 2024 winner SMART (Wu et al., 2024). MPS achieved near-top performance in a few safety critical metrics such as COLLISION and OFFROAD, showing the effectiveness of the priors in our model. MPS showed a lack of performance in LINEAR SPEED / ACCEL. We speculate this is because MPS can generate diverse rollouts that are very different from the logged data used for metric evaluation.

META METRIC

KINEMATIC

INTERACTIVE

MAP

METHOD

REALISM

LINEAR

SPEED

LINEAR

ACCEL.

ANG.

SPEED

ANG.

ACCEL.

DIST.

TO OBJ.

COLLISION

TTC

DIST.

TO ROAD

OFFROAD

minADE

\downarrow

MTR+RAND

0.7019

0.3922

0.3530

0.3899

0.3304

0.3691

0.8491

0.8164

0.6706

0.9207

1.3084

MPS

0.7418

0.3158

0.3056

0.4664

0.5818

0.3604

0.9617

0.8094

0.6651

0.9374

1.4841

Table 2: Abalation study – comparing MPS to MTR with random trajectory sampling.

Ablation Study To evaluate the value of the PGM priors, we compare MPS to the same MTR model with random trajectory sampling (MTR+RAND) on the validation dataset. As shown in Table 2, MPS improved safety-critical metrics such as COLLISION, OFFROAD and the overall REALISM score, while lacked performance at LINEAR SPEED / ACCEL for the same reason discussed above.

Qualitative Study Qualitatively, MPS generates diverse (multi-modal) predictions (Fig. 3), and each prediction contains realistic traffic patterns such as lane merging, unprotected left turn, yielding, among others (Fig. 4).

4 Conclusion

We explored an approach that can improve on any trajectory simulation model by adding domain-specific priors, and performing inference in the corresponding PGM. We believe combing prior-driven (top-down) and data-driven (bottom-up) methods is key to building robust and reliable autonomous driving planning and simulation.⁴⁴4We thank Joseph Ortiz and Wolfgang Lehrach for many useful discussions and suggestions.

\nobibliography

References

Ettinger et al. (2021) S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y. Chai, B. Sapp, C. R. Qi, Y. Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In ICCV, 2021.
Luo et al. (2023) W. Luo, C. Park, A. Cornman, B. Sapp, and D. Anguelov. Jfp: Joint future prediction with interactive multi-agent modeling for autonomous driving. In CoRL, 2023.
Montali et al. (2023) N. Montali, J. Lambert, P. Mougin, A. Kuefler, N. Rhinehart, M. Li, C. Gulino, T. Emrich, Z. Yang, S. Whiteson, B. White, and D. Anguelov. The waymo open sim agents challenge. In NIPS Datasets Track, May 2023.
Nayakanti et al. (2023) N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp. Wayformer: Motion forecasting via simple & efficient attention networks. In ICRA. IEEE, 2023.
Ngiam et al. (2021) J. Ngiam, B. Caine, V. Vasudevan, Z. Zhang, H.-T. L. Chiang, J. Ling, R. Roelofs, A. Bewley, C. Liu, A. Venugopal, et al. Scene transformer: A unified architecture for predicting multiple agent trajectories. arXiv preprint arXiv:2106.08417, 2021.
Patwardhan et al. (2022) A. Patwardhan, R. Murai, and A. J. Davison. Distributing collaborative Multi-Robot planning with gaussian belief propagation. In IEEE Robotics and Automation Letters, Mar. 2022.
Philion et al. (2023) J. Philion, X. B. Peng, and S. Fidler. Trajeglish: Learning the language of driving scenarios. arXiv preprint arXiv:2312.04535, 2023.
Schwenzer et al. (2021) M. Schwenzer, M. Ay, T. Bergs, and D. Abel. Review on model predictive control: An engineering perspective. The International Journal of Advanced Manufacturing Technology, 117(5):1327–1349, 2021.
Seff et al. (2023) A. Seff, B. Cera, D. Chen, M. Ng, A. Zhou, N. Nayakanti, K. S. Refaat, R. Al-Rfou, and B. Sapp. Motionlm: Multi-agent motion forecasting as language modeling. In ICCV, 2023.
Shi et al. (2024) S. Shi, L. Jiang, D. Dai, and B. Schiele. Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
Wang et al. (2023) Y. Wang, T. Zhao, and F. Yi. Multiverse transformer: 1st place solution for waymo open sim agents challenge 2023. arXiv preprint arXiv:2306.11868, 2023.
Wu et al. (2024) W. Wu, X. Feng, Z. Gao, and Y. Kan. Smart: Scalable multi-agent real-time simulation via next-token prediction. arXiv preprint arXiv:2405.15677, 2024.
Zhou et al. (2023) Z. Zhou, J. Wang, Y.-H. Li, and Y.-K. Huang. Query-centric trajectory prediction. In CVPR, 2023.

Appendix

	mAP	minADE	minFDE	MissRate
Vehicle	0.4745	0.7526	1.5107	0.1489
Pedestrian	0.4827	0.3455	0.7251	0.0756
Cyclist	0.3898	0.7095	1.4299	0.1865
Avg	0.4490	0.6025	1.2219	0.1370

Table 3: Evaluation of our MTR model on the WOMD Motion Prediction dataset (validation).