\pdftrailerid

redacted \reportnumber001

Model Predictive Simulation Using Structured Graphical Models and Transformers

Xinghua Lou Google DeepMind Meet Dave Google DeepMind Shrinu Kushagra Google DeepMind Miguel Lázaro-Gredilla Google DeepMind Kevin Murphy Google DeepMind
Abstract

We propose an approach to simulating trajectories of multiple interacting agents (road users) based on transformers and probabilistic graphical models (PGMs), and apply it to the Waymo SimAgents challenge. The transformer baseline is based on the MTR model (Shi et al., 2024), which predicts multiple future trajectories conditioned on the past trajectories and static road layout features. We then improve upon these generated trajectories using a PGM, which contains factors which encode prior knowledge, such as a preference for smooth trajectories, and avoidance of collisions with static obstacles and other moving agents. We perform (approximate) MAP inference in this PGM using the Gauss-Newton method. Finally we sample K=32𝐾32K=32italic_K = 32 trajectories for each of the N100similar-to𝑁100N\sim 100italic_N ∼ 100 agents for the next T=8Δ𝑇8ΔT=8\Deltaitalic_T = 8 roman_Δ time steps, where Δ=10Δ10\Delta=10roman_Δ = 10 is the sampling rate per second. Following the Model Predictive Control (MPC) paradigm, we only return the first element of our forecasted trajectories at each step, and then we replan, so that the simulation can constantly adapt to its changing environment. We therefore call our approach "Model Predictive Simulation" or MPS. We show that MPS improves upon the MTR baseline, especially in safety critical metrics such as collision rate. Furthermore, our approach is compatible with any underlying forecasting model, and does not require extra training, so we believe it is a valuable contribution to the community.

keywords:
Multi-Agent Planning, Probabilistic Graphical Model, Model Predictive Control, Transformer

1 Introduction

The use of transformers to create generative models to simulate agent trajectories, trained on large datasets such as Waymo Open Data (Ettinger et al., 2021), has become very popular in recent years. Most previous work has been focusing on improving the architecture (Nayakanti et al., 2023; Shi et al., 2024), the training objective (Ngiam et al., 2021; Shi et al., 2024), the trajectory representation (Seff et al., 2023; Philion et al., 2023) or the speed (Zhou et al., 2023) of these transformer-based models.

This paper tackles the problem from an orthogonal and complementary angle – namely the use of prior knowledge, encoded using a probabilistic graphical model (PGM). We perform approximate MAP inference in the PGM to “post process” the trajectory proposals from a base transformer model, to increase their realism and compliance with constraints, such as collision avoidance.

To ensure that our predicted forecasts are adaptive to the changing environment, we replan at each step, following the principle of model predictive control (MPC), which is widely used for controlling complex dynamical systems (Schwenzer et al., 2021). We therefore call our approach Model Predictive Simulation (MPS).

Our MPS approach differs from previous PGM methods for trajectory simulation, such as JFP (Luo et al., 2023), in several ways. First, we explicitly include (data-dependent) factors for collision avoidance and smooth trajectories, so we have better control over the generated trajectories. Second, our approach is iterative (being based on MPC), while JFP commits to the trajectory proposals at t=0𝑡0t=0italic_t = 0 and is thus open loop. Third, our approach uses the Gauss-Newton method to compute the joint MAP estimate, whereas JFP is based on discrete belief propagation methods to choose amongst a finite set of candidate trajectories.

2 Method

Outer loop

Input: Scene context 𝐜𝐜\mathbf{c}bold_c, num. agents N𝑁Nitalic_N, num. samples K𝐾Kitalic_K, trajetory length T𝑇Titalic_T
Output: Sampled trajectories, h1:N1:K,1:Tsuperscriptsubscript:1𝑁:1𝐾1:𝑇h_{1:N}^{1:K,1:T}italic_h start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_K , 1 : italic_T end_POSTSUPERSCRIPT
for k=1𝑘1k=1italic_k = 1 to K𝐾Kitalic_K do
      s1:Nk,0=init-trajectory(𝐜)superscriptsubscript𝑠:1𝑁𝑘0init-trajectory𝐜s_{1:N}^{k,0}=\text{init-trajectory}(\mathbf{c})italic_s start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , 0 end_POSTSUPERSCRIPT = init-trajectory ( bold_c )
     for t=1𝑡1t=1italic_t = 1 to T𝑇Titalic_T do
          
          Sample r1:Nk,t=MPS(𝐜,s1:Nk,1:t1)superscriptsubscript𝑟:1𝑁𝑘𝑡MPS𝐜superscriptsubscript𝑠:1𝑁:𝑘1𝑡1r_{1:N}^{k,t}=\text{MPS}(\mathbf{c},s_{1:N}^{k,1:t-1})italic_r start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_t end_POSTSUPERSCRIPT = MPS ( bold_c , italic_s start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , 1 : italic_t - 1 end_POSTSUPERSCRIPT )
          Extend s1:Nk,1:t=append(s1:Nk,1:t,r1:Nk,t)superscriptsubscript𝑠:1𝑁:𝑘1𝑡appendsuperscriptsubscript𝑠:1𝑁:𝑘1𝑡superscriptsubscript𝑟:1𝑁𝑘𝑡s_{1:N}^{k,1:t}=\text{append}(s_{1:N}^{k,1:t},r_{1:N}^{k,t})italic_s start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , 1 : italic_t end_POSTSUPERSCRIPT = append ( italic_s start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , 1 : italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_t end_POSTSUPERSCRIPT )
           end for
          
           end for
          
Algorithm 1 SimAgents outer loop

The overall simulation pseudocode is shown in Algo. 1. It generates a set of K=32𝐾32K=32italic_K = 32 trajectories, each of length T=80𝑇80T=80italic_T = 80, for N𝑁Nitalic_N agents given the scene context 𝐜𝐜\mathbf{c}bold_c. (The exact value of N𝑁Nitalic_N depends on the number of agents that are visible in 𝐜𝐜\mathbf{c}bold_c.) We denote the generated output by s1:N1:K,1:Tsuperscriptsubscript𝑠:1𝑁:1𝐾1:𝑇s_{1:N}^{1:K,1:T}italic_s start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_K , 1 : italic_T end_POSTSUPERSCRIPT, where sik,t=[x,y,x˙,y˙]superscriptsubscript𝑠𝑖𝑘𝑡𝑥𝑦˙𝑥˙𝑦s_{i}^{k,t}=[x,y,\dot{x},\dot{y}]italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_t end_POSTSUPERSCRIPT = [ italic_x , italic_y , over˙ start_ARG italic_x end_ARG , over˙ start_ARG italic_y end_ARG ] is the state (2d location and velocity) of the i𝑖iitalic_i’th agent in sample k𝑘kitalic_k.

Inner loop

Input: Scene context 𝐜𝐜\mathbf{c}bold_c, agent history h1:Nsubscript:1𝑁h_{1:N}italic_h start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT, num. agents N𝑁Nitalic_N, future planning horizon F𝐹Fitalic_F, number of rollouts J𝐽Jitalic_J, transformer proposal π𝜋\piitalic_π
Output: Predicted next state for each agent, r1:Nsubscript𝑟:1𝑁r_{1:N}italic_r start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT
for j=1𝑗1j=1italic_j = 1 to J𝐽Jitalic_J do
      Sample (a1:Nj,1:F,g1:Nj)π(𝐜,h1:N,F)similar-tosuperscriptsubscript𝑎:1𝑁:𝑗1𝐹superscriptsubscript𝑔:1𝑁𝑗𝜋𝐜subscript:1𝑁𝐹(a_{1:N}^{j,1:F},g_{1:N}^{j})\sim\pi(\mathbf{c},h_{1:N},F)( italic_a start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , 1 : italic_F end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∼ italic_π ( bold_c , italic_h start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT , italic_F )
     Gj=BuildFactorGraph(a1:Nj,1:F,g1:Nj,𝐜)superscript𝐺𝑗BuildFactorGraphsuperscriptsubscript𝑎:1𝑁:𝑗1𝐹superscriptsubscript𝑔:1𝑁𝑗𝐜G^{j}=\text{BuildFactorGraph}(a_{1:N}^{j,1:F},g_{1:N}^{j},\mathbf{c})italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = BuildFactorGraph ( italic_a start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , 1 : italic_F end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_c )
     Initialize s1:Nj,1:F=a1:Nj,1:Fsuperscriptsubscript𝑠:1𝑁:𝑗1𝐹superscriptsubscript𝑎:1𝑁:𝑗1𝐹s_{1:N}^{j,1:F}=a_{1:N}^{j,1:F}italic_s start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , 1 : italic_F end_POSTSUPERSCRIPT = italic_a start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , 1 : italic_F end_POSTSUPERSCRIPT
     (s1:Nj,1:F,Ej)=Inference(Gj,s1:Nj,1:F)superscriptsubscript𝑠:1𝑁:𝑗1𝐹superscript𝐸𝑗Inferencesuperscript𝐺𝑗superscriptsubscript𝑠:1𝑁:𝑗1𝐹(s_{1:N}^{j,1:F},E^{j})=\text{Inference}(G^{j},s_{1:N}^{j,1:F})( italic_s start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , 1 : italic_F end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) = Inference ( italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , 1 : italic_F end_POSTSUPERSCRIPT )
      end for
     
     Sample jSoftMin(E1:J)similar-tosuperscript𝑗SoftMinsuperscript𝐸:1𝐽j^{*}\sim\text{SoftMin}(E^{1:J})italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∼ SoftMin ( italic_E start_POSTSUPERSCRIPT 1 : italic_J end_POSTSUPERSCRIPT )
     return s1:Nj,1superscriptsubscript𝑠:1𝑁superscript𝑗1s_{1:N}^{j^{*},1}italic_s start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , 1 end_POSTSUPERSCRIPT
Algorithm 2 Model Predictive Simulation

At each step t𝑡titalic_t, the simulator calls our MPS algorithm to generate a prediction for the next state of each agent. The pseudocode for this is shown in Algo. 2. The approach is as follows. First we use the MTR transformer model π𝜋\piitalic_π (Shi et al., 2024) to sample a set of N𝑁Nitalic_N goal locations, g1:Njsuperscriptsubscript𝑔:1𝑁𝑗g_{1:N}^{j}italic_g start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, one for each agent, as well as a sequence of anchor points leading to each goal, a1:Nj,1:Fsuperscriptsubscript𝑎:1𝑁:𝑗1𝐹a_{1:N}^{j,1:F}italic_a start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , 1 : italic_F end_POSTSUPERSCRIPT, where F𝐹Fitalic_F is the planning or forecast horizon. We do this J=60𝐽60J=60italic_J = 60 times in parallel, to create a set of possible futures. We then use the PGM to generate J𝐽Jitalic_J joint trajectories (for all N𝑁Nitalic_N agents), using the method described below. Finally we evaluate the energy of each generated trajectory, Ejsuperscript𝐸𝑗E^{j}italic_E start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, sample one of the low energy (high probability) ones to get s1:Nj,1:Fsuperscriptsubscript𝑠:1𝑁:superscript𝑗1𝐹s_{1:N}^{j^{*},1:F}italic_s start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , 1 : italic_F end_POSTSUPERSCRIPT, and return the first step of this sampled trajectory, s1:Nj,1superscriptsubscript𝑠:1𝑁superscript𝑗1s_{1:N}^{j^{*},1}italic_s start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , 1 end_POSTSUPERSCRIPT.

Graphical model

Refer to caption
Figure 1: Factor Graph for N=2𝑁2N=2italic_N = 2 agents unrolled for T𝑇Titalic_T planning steps. Circles are random variables, gray squares are fixed factors.
p(𝐬1..N1:F𝐜,𝐚1:N1:F,g1:N)\displaystyle p(\mathbf{s}^{1:F}_{1..N}\mid\mathbf{c},\mathbf{a}_{1:N}^{1:F},g% _{1:N})\proptoitalic_p ( bold_s start_POSTSUPERSCRIPT 1 : italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 . . italic_N end_POSTSUBSCRIPT ∣ bold_c , bold_a start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_F end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT ) ∝ i=1N[fG(siF|gi)t=1F1fM(sit|ait)]\displaystyle\prod_{i=1}^{N}\left[f_{G}(s_{i}^{F}|g_{i})\cdot\prod_{t=1}^{F-1}% f_{M}(s_{i}^{t}|a_{i}^{t})\right]\cdot∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT | italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] ⋅
i=1N[t=2FfL(sit1,sit)fA(sit1,sit)]\displaystyle\prod_{i=1}^{N}\left[\prod_{t=2}^{F}f_{L}(s_{i}^{t-1},s_{i}^{t})% \cdot f_{A}(s_{i}^{t-1},s_{i}^{t})\right]\cdot∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] ⋅
i=1Nt=1FfO(sit𝐜)ijt=1FfC(sit,sjt)superscriptsubscriptproduct𝑖1𝑁superscriptsubscriptproduct𝑡1𝐹subscript𝑓𝑂conditionalsuperscriptsubscript𝑠𝑖𝑡𝐜subscriptproduct𝑖𝑗superscriptsubscriptproduct𝑡1𝐹subscript𝑓𝐶superscriptsubscript𝑠𝑖𝑡superscriptsubscript𝑠𝑗𝑡\displaystyle\prod_{i=1}^{N}\prod_{t=1}^{F}f_{O}(s_{i}^{t}\mid\mathbf{c})\cdot% \prod_{i\neq j}\prod_{t=1}^{F}f_{C}(s_{i}^{t},s_{j}^{t})∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∣ bold_c ) ⋅ ∏ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
Figure 2: Joint probability model.

The key to our method is the probabilistic graphical model (PGM) for improving upon the proposed trajectories by MTR. The factor graph is shown in Fig. 1 and the corresponding conditional joint distribution is given in Fig. 2. The model was inspired by (Patwardhan et al., 2022) who uses Gaussian belief propagation. We now explain each of the factors.

First we have factors which compare a candidate trajectory to the original proposal. The motion factor is defined as fM(𝐬it)=|sitait|subscript𝑓Msuperscriptsubscript𝐬𝑖𝑡superscriptsubscript𝑠𝑖𝑡superscriptsubscript𝑎𝑖𝑡f_{\mathrm{M}}(\mathbf{s}_{i}^{t})=|s_{i}^{t}-a_{i}^{t}|italic_f start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT |, where aitsuperscriptsubscript𝑎𝑖𝑡a_{i}^{t}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the predicted location (anchor point) for agent i𝑖iitalic_i at time t𝑡titalic_t as computed by π𝜋\piitalic_π. This ensures the trajectory stays close to the initial proposal. The proximity to goal factor is defined as fG(siF)=|siFgi|subscript𝑓Gsuperscriptsubscript𝑠𝑖𝐹superscriptsubscript𝑠𝑖𝐹subscript𝑔𝑖f_{\mathrm{G}}(s_{i}^{F})=|s_{i}^{F}-g_{i}|italic_f start_POSTSUBSCRIPT roman_G end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ) = | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT - italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |, where gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the goal for agent i𝑖iitalic_i predicted by π𝜋\piitalic_π. This ensures the trajectory ends close to where we expect.

Second we have factors defined from "physics". We define a factor that penalizes deviation from linear motion: fL(s,s)=|sxy(sxy+sx˙y˙Δt)|subscript𝑓L𝑠superscript𝑠subscriptsuperscript𝑠𝑥𝑦subscript𝑠𝑥𝑦subscript𝑠˙𝑥˙𝑦subscriptΔ𝑡f_{\mathrm{L}}(s,s^{\prime})=|s^{\prime}_{xy}-(s_{xy}+s_{\dot{x}\dot{y}}\Delta% _{t})|italic_f start_POSTSUBSCRIPT roman_L end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT - ( italic_s start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT over˙ start_ARG italic_x end_ARG over˙ start_ARG italic_y end_ARG end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) |, where sxysubscript𝑠𝑥𝑦s_{xy}italic_s start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT are the location components of s𝑠sitalic_s, sx˙y˙subscript𝑠˙𝑥˙𝑦s_{\dot{x}\dot{y}}italic_s start_POSTSUBSCRIPT over˙ start_ARG italic_x end_ARG over˙ start_ARG italic_y end_ARG end_POSTSUBSCRIPT are the velocity components of s𝑠sitalic_s, and ΔtΔ𝑡\Delta troman_Δ italic_t is the sampling rate. We also define a factor that penalizes change in direction: fA(s,s)=|sx˙y˙sx˙y˙|subscript𝑓A𝑠superscript𝑠subscript𝑠˙𝑥˙𝑦subscriptsuperscript𝑠˙𝑥˙𝑦f_{\mathrm{A}}(s,s^{\prime})=|s_{\dot{x}\dot{y}}-s^{\prime}_{\dot{x}\dot{y}}|italic_f start_POSTSUBSCRIPT roman_A end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = | italic_s start_POSTSUBSCRIPT over˙ start_ARG italic_x end_ARG over˙ start_ARG italic_y end_ARG end_POSTSUBSCRIPT - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over˙ start_ARG italic_x end_ARG over˙ start_ARG italic_y end_ARG end_POSTSUBSCRIPT |. We used weight 2.0 for fAsubscript𝑓Af_{\mathrm{A}}italic_f start_POSTSUBSCRIPT roman_A end_POSTSUBSCRIPT and 1.0 for all other factors.

Third we have factors derived from static obstacles on the road: fO(s𝐜)=max(x,y)𝐜REG(x,ys)subscript𝑓Oconditional𝑠𝐜subscript𝑥𝑦subscript𝐜REG𝑥conditional𝑦𝑠f_{\mathrm{O}}(s\mid\mathbf{c})=\max_{(x,y)\in\mathbf{c}_{\mathrm{RE}}}\mathrm% {G}(x,y\mid s)italic_f start_POSTSUBSCRIPT roman_O end_POSTSUBSCRIPT ( italic_s ∣ bold_c ) = roman_max start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ bold_c start_POSTSUBSCRIPT roman_RE end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_G ( italic_x , italic_y ∣ italic_s ), where 𝐜REsubscript𝐜RE\mathbf{c}_{\mathrm{RE}}bold_c start_POSTSUBSCRIPT roman_RE end_POSTSUBSCRIPT represents the coordinates of the road edges (part of the context 𝐜𝐜\mathbf{c}bold_c) and G(x,ys)G𝑥conditional𝑦𝑠\mathrm{G}(x,y\mid s)roman_G ( italic_x , italic_y ∣ italic_s ) is a Gaussian field centered and rotated according to the agent’s location sxysubscript𝑠𝑥𝑦s_{xy}italic_s start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT.

Finally, we have pairwise collision factors between agents: fC(s,s)=max(x,y)CCP(s)G(x,ys)subscript𝑓C𝑠superscript𝑠subscript𝑥𝑦CCPsuperscript𝑠G𝑥conditional𝑦𝑠f_{\mathrm{C}}(s,s^{\prime})=\max_{(x,y)\in\mathrm{CCP}(s^{\prime})}\mathrm{G}% (x,y\mid s)italic_f start_POSTSUBSCRIPT roman_C end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_max start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ roman_CCP ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT roman_G ( italic_x , italic_y ∣ italic_s ), where G(x,ys)G𝑥conditional𝑦𝑠\mathrm{G}(x,y\mid s)roman_G ( italic_x , italic_y ∣ italic_s ) is a Gaussian field for agent s𝑠sitalic_s, and CCP(s)CCPsuperscript𝑠\mathrm{CCP}(s^{\prime})roman_CCP ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) are the 9 collision checking points (CCP) for the other agent ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (4 corners, 4 centers of the sides, and center of the agent).

Inference

Inference on the factor graph is equivalent to minimizing a non-linear, non-convex quadratic optimization problem defined over 𝐬1:N1:Fsuperscriptsubscript𝐬:1𝑁:1𝐹\mathbf{s}_{1:N}^{1:F}bold_s start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_F end_POSTSUPERSCRIPT. For efficiency reasons, we developed a two-step approach. First, we use the Gauss–Newton method to solve a partial model that only consists of fMsubscript𝑓Mf_{\mathrm{M}}italic_f start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT, fLsubscript𝑓Lf_{\mathrm{L}}italic_f start_POSTSUBSCRIPT roman_L end_POSTSUBSCRIPT and fAsubscript𝑓Af_{\mathrm{A}}italic_f start_POSTSUBSCRIPT roman_A end_POSTSUBSCRIPT factors, as these can all be evaluated in parallel across agents using N𝑁Nitalic_N individual trajectory models. This step produces smoothed trajectories, which are then frozen. Second, we sample joint trajectories for agents according to their probability (unnormalized energy), and use the fOsubscript𝑓Of_{\mathrm{O}}italic_f start_POSTSUBSCRIPT roman_O end_POSTSUBSCRIPT and fCsubscript𝑓Cf_{\mathrm{C}}italic_f start_POSTSUBSCRIPT roman_C end_POSTSUBSCRIPT factors to score their quality. After repeating this J𝐽Jitalic_J times, the best joint trajectories are sampled from a softmin operation over the scores of the J𝐽Jitalic_J samples.

WAYMO META METRIC KINEMATIC INTERACTIVE MAP
LEADERBOARD REALISM
LINEAR
SPEED
LINEAR
ACCEL.
ANG.
SPEED
ANG.
ACCEL.
DIST.
TO OBJ.
COLLISION TTC
DIST.
TO ROAD
OFFROAD minADE \downarrow
SMART 0.7511 0.3646 0.4057 0.4231 0.5844 0.3769 0.9655 0.8317 0.6590 0.9362 1.5447
MVTE 0.7301 0.3506 0.3530 0.4974 0.5999 0.3742 0.9049 0.8309 0.6655 0.9071 1.6769
MPS (Ours) 0.7416 0.3137 0.3049 0.4705 0.5834 0.3593 0.9629 0.8070 0.6651 0.9366 1.4841
Table 1: WOSAC Leaderboard: SMART (2024 winner) Vs. MVTE (2023 winner) Vs. MPS (ours).

3 Experimental Evaluation

Benchmark We evaluated MPS on the 2024 Waymo Open Sim Agents Challenge (Montali et al., 2023), where the task is simulating 32 realistic rollouts of all agents in the scene given their 1s history for 8s into the future. The simulation needs to be closed-loop and factorized between the ADV and other agents, which MPS satisfies naturally.

Implementation Details We implemented the factors and the inference in JAX 111https://github.com/google/jax and JAXopt 222https://jaxopt.github.io/ for the Gauss-Newton method. We leveraged JAX’s just-in-time (JIT) compilation and observed great scalability. For speed up, we take 10 immediate next steps at each MPS iteration.

We trained our own MTR model π𝜋\piitalic_π using the open source code 333https://github.com/sshaoshuai/MTR . We removed local attention and reduced the source polylines to 512. The training data is augmented by adding extra interacting agents, and by applying random history dropouts. We followed the original training setup except the number of epochs (50), the batch size (8) and the LR schedule ([25, 30, 35, 40, 45]). Training took about 3 days on 16 A100s.

We used only the official Waymo Open Motion Dataset v1.2.1 and did not use any Lidar or Camera data. We did not need any additional training and we did not use ensembles.

Sim Agents 2024 Results We ranked number 4 among all methods (Table 1). We outperformed the 2023 winner MVTE (Wang et al., 2023) which also uses MTR (Shi et al., 2024), and are approximately 1 point behind the 2024 winner SMART (Wu et al., 2024). MPS achieved near-top performance in a few safety critical metrics such as COLLISION and OFFROAD, showing the effectiveness of the priors in our model. MPS showed a lack of performance in LINEAR SPEED / ACCEL. We speculate this is because MPS can generate diverse rollouts that are very different from the logged data used for metric evaluation.

META METRIC KINEMATIC INTERACTIVE MAP
METHOD REALISM
LINEAR
SPEED
LINEAR
ACCEL.
ANG.
SPEED
ANG.
ACCEL.
DIST.
TO OBJ.
COLLISION TTC
DIST.
TO ROAD
OFFROAD minADE \downarrow
MTR+RAND 0.7019 0.3922 0.3530 0.3899 0.3304 0.3691 0.8491 0.8164 0.6706 0.9207 1.3084
MPS 0.7418 0.3158 0.3056 0.4664 0.5818 0.3604 0.9617 0.8094 0.6651 0.9374 1.4841
Table 2: Abalation study – comparing MPS to MTR with random trajectory sampling.

Ablation Study To evaluate the value of the PGM priors, we compare MPS to the same MTR model with random trajectory sampling (MTR+RAND) on the validation dataset. As shown in Table 2, MPS improved safety-critical metrics such as COLLISION, OFFROAD and the overall REALISM score, while lacked performance at LINEAR SPEED / ACCEL for the same reason discussed above.

Qualitative Study Qualitatively, MPS generates diverse (multi-modal) predictions (Fig. 3), and each prediction contains realistic traffic patterns such as lane merging, unprotected left turn, yielding, among others (Fig. 4).

Refer to caption
Figure 3: MPS creates diverse (multi-modal) rollouts.
Refer to caption
Figure 4: MPS creates realistic traffic patterns.

4 Conclusion

We explored an approach that can improve on any trajectory simulation model by adding domain-specific priors, and performing inference in the corresponding PGM. We believe combing prior-driven (top-down) and data-driven (bottom-up) methods is key to building robust and reliable autonomous driving planning and simulation.444We thank Joseph Ortiz and Wolfgang Lehrach for many useful discussions and suggestions.

\nobibliography

*

References

  • Ettinger et al. (2021) S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y. Chai, B. Sapp, C. R. Qi, Y. Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In ICCV, 2021.
  • Luo et al. (2023) W. Luo, C. Park, A. Cornman, B. Sapp, and D. Anguelov. Jfp: Joint future prediction with interactive multi-agent modeling for autonomous driving. In CoRL, 2023.
  • Montali et al. (2023) N. Montali, J. Lambert, P. Mougin, A. Kuefler, N. Rhinehart, M. Li, C. Gulino, T. Emrich, Z. Yang, S. Whiteson, B. White, and D. Anguelov. The waymo open sim agents challenge. In NIPS Datasets Track, May 2023.
  • Nayakanti et al. (2023) N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp. Wayformer: Motion forecasting via simple & efficient attention networks. In ICRA. IEEE, 2023.
  • Ngiam et al. (2021) J. Ngiam, B. Caine, V. Vasudevan, Z. Zhang, H.-T. L. Chiang, J. Ling, R. Roelofs, A. Bewley, C. Liu, A. Venugopal, et al. Scene transformer: A unified architecture for predicting multiple agent trajectories. arXiv preprint arXiv:2106.08417, 2021.
  • Patwardhan et al. (2022) A. Patwardhan, R. Murai, and A. J. Davison. Distributing collaborative Multi-Robot planning with gaussian belief propagation. In IEEE Robotics and Automation Letters, Mar. 2022.
  • Philion et al. (2023) J. Philion, X. B. Peng, and S. Fidler. Trajeglish: Learning the language of driving scenarios. arXiv preprint arXiv:2312.04535, 2023.
  • Schwenzer et al. (2021) M. Schwenzer, M. Ay, T. Bergs, and D. Abel. Review on model predictive control: An engineering perspective. The International Journal of Advanced Manufacturing Technology, 117(5):1327–1349, 2021.
  • Seff et al. (2023) A. Seff, B. Cera, D. Chen, M. Ng, A. Zhou, N. Nayakanti, K. S. Refaat, R. Al-Rfou, and B. Sapp. Motionlm: Multi-agent motion forecasting as language modeling. In ICCV, 2023.
  • Shi et al. (2024) S. Shi, L. Jiang, D. Dai, and B. Schiele. Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • Wang et al. (2023) Y. Wang, T. Zhao, and F. Yi. Multiverse transformer: 1st place solution for waymo open sim agents challenge 2023. arXiv preprint arXiv:2306.11868, 2023.
  • Wu et al. (2024) W. Wu, X. Feng, Z. Gao, and Y. Kan. Smart: Scalable multi-agent real-time simulation via next-token prediction. arXiv preprint arXiv:2405.15677, 2024.
  • Zhou et al. (2023) Z. Zhou, J. Wang, Y.-H. Li, and Y.-K. Huang. Query-centric trajectory prediction. In CVPR, 2023.

Appendix

mAP minADE minFDE MissRate
Vehicle 0.4745 0.7526 1.5107 0.1489
Pedestrian 0.4827 0.3455 0.7251 0.0756
Cyclist 0.3898 0.7095 1.4299 0.1865
Avg 0.4490 0.6025 1.2219 0.1370
Table 3: Evaluation of our MTR model on the WOMD Motion Prediction dataset (validation).
Refer to caption
Figure 5: Three simulated scenarios (top to bottom) at different timesteps (left to right) showcasing multi-modal behavior of agents. In the top and bottom simulation, the dark green car takes the left turn. However, in the middle simulation, it turns right. The green car, in the top and middle simulation, attempts the lane change to the left as the cars in front wait at the signal. In the middle simulation, the same green car comes to a stop in the same lane behind the traffic.
Refer to caption
Figure 6: Two simulated scenarios (top to bottom) at different timesteps (left to right). In the top simulation, the light green car, attempting to take the right turn, stops and respects the teal car’s right of way. Whereas, in the bottom simulation, the light green car, quickly takes the right turn.
Refer to caption
Figure 7: Simulated scenario at different timesteps(left to right). The dark green car, about to take the free right, waits for the pedestrian to cross the road.
Refer to caption
Figure 8: Simulated scenario at different timesteps(left to right). The dark green car merges to the left lane as the right lane comes to an end.
Refer to caption
Figure 9: Simulated scenario at different timesteps(left to right). The cars come to stop at the signal. Additionally, the teal car starts to slow down and stops as the golden car in front is waiting at the signal.
Refer to caption
Figure 10: Simulated scenario at different timesteps(left to right). The red car waits for the purple car to pass before taking an unprotected left turn.
Refer to caption
Figure 11: Two simulated scenarios (top to bottom) at different timesteps (left to right). The top simulation shows the brown car entering the parking lot and driving straight. In the bottom simulation, the brown car attempts to park besides the pink car. We can also observe the dark purple car, waiting for the teal car to complete the U-turn before taking the free right.