SemanticFormer: Holistic and Semantic Traffic Scene Representation for Trajectory Prediction using Knowledge Graphs

Zhigang Sun1, Zixu Wang2,3, Lavdim Halilaj3, Juergen Luettin3 1Zhigang Sun is with Bosch Center for Artificial Intelligence, (Corresponding author: Zhigang Sun) [email protected], [email protected]3Zixu Wang, Lavdim Halilaj, Juergen Luettin are with Robert Bosch GmbH {firstname.lastname}@bosch.com2Zixu Wang is with the Technical University of Munich (TUM), Germany [email protected]
Abstract

Trajectory prediction in autonomous driving relies on accurate representation of all relevant contexts of the driving scene, including traffic participants, road topology, traffic signs, as well as their semantic relations to each other. Despite increased attention to this issue, most approaches in trajectory prediction do not consider all of these factors sufficiently. We present SemanticFormer, an approach for predicting multimodal trajectories by reasoning over a semantic traffic scene graph using a hybrid approach. It utilizes high-level information in the form of meta-paths, i.e. trajectories on which an agent is allowed to drive from a knowledge graph which is then processed by a novel pipeline based on multiple attention mechanisms to predict accurate trajectories. SemanticFormer comprises a hierarchical heterogeneous graph encoder to capture spatio-temporal and relational information across agents as well as between agents and road elements. Further, it includes a predictor to fuse different encodings and decode trajectories with probabilities. Finally, a refinement module assesses permitted meta-paths of trajectories and speed profiles to obtain final predicted trajectories. Evaluation of the nuScenes benchmark demonstrates improved performance compared to several SOTA methods. In addition, we demonstrate that our knowledge graph can be easily added to two graph-based existing SOTA methods, namely VectorNet and LaFormer, replacing their original homogeneous graphs. The evaluation results suggest that by adding our knowledge graph the performance of the original methods is enhanced by 5% and 4%, respectively. Graph data is available at https://github.com/boschresearch/nuScenes_Knowledge_Graph

I INTRODUCTION

Autonomous vehicles are recognized as a promising solution to address critical challenges such as road safety, traffic congestion, and energy optimization. A crucial task towards the realization of autonomous driving vision is motion prediction[1]. It involves determining a set of spatial coordinates that represent the predicted movement of a given agent within a future time window. However, motion prediction is a challenging task due to various contextual factors such as the difficulty of intention prediction, the complex interactions of traffic participants, the intricate road topology, comprising lanes, lane dividers, and pedestrian crossings, as well as adherence to traffic regulations. State-of-the-art approaches utilize various representations for traffic scenes such as raster-based [2, 3], or graph-based [4, 5] to capture and utilize contextual information sufficiently.

Refer to caption
Figure 1: Driving scenes represented in a heterogeneous graph capturing all relevant map details, traffic agents, and their semantic relationships.

Recent work applies a knowledge graph (KG) to encode diverse contextual information from traffic scenes [6]. Figure 1 illustrates various types of elements comprised in a typical traffic scene including different entities and their relations along with their semantic descriptions. We propose a novel approach that leverages heterogeneous information of static and dynamic elements modeled in the KG. It contains an attention mechanism for consuming semantic relationships and dependencies between traffic agents and road elements for accurate multimodal trajectory prediction. Main contributions:

  • A knowledge graph based approach to encode all relevant static and dynamic elements of a traffic scene with their semantic relationships.

  • A hybrid architecture with attention mechanisms to model the semantic relationships and dependencies between traffic agents and road elements for accurate multi-modal trajectory prediction. Evaluated on nuScenes dataset [7].

  • Dedicated experiments to demonstrate the easiness of incorporating our KG into existing graph-based trajectory prediction models. Concretely, we integrate the KG into VectorNet [5] and LaFormer [8] (changing GIG block). The evaluation results show that incorporating KG with VectorNet and LaFormer helps improve their ADE performance by 5% and 4%, respectively.

Refer to caption
Figure 2: SemanticFormer Overview: Data Representation models the static map information and dynamic agents interaction by a holistic knowledge graph. Scene Graph Encoder extracts meta-paths and generates holistic latent representation for agents and lanes. Probability Predictor fuses the encodings and outputs trajectory candidates. Prediction Refinement uses anchor paths and speed profiles to evaluate trajectories and generates final predictions.

II RELATED WORK

Representation. Early methods for trajectory prediction use raster-based birds-eye-view representations of the map and agents encoding them with a number of channels for different information sources [9, 10]. These methods are extended to predict multiple trajectories with associated probabilities[2, 3]. Others aim to estimate probability distribution heat maps representing locations where agents could be located at a fixed time horizon [11, 12]. However, these models usually do not have access to high-level information and need to learn complex relationships from raw pixels.

Graph-based approaches represent scenes as vectors, polylines and graphs and thus operate at a higher level of abstraction [5, 13, 14, 15, 16, 8]. VectorNet [5] encodes both map features and agent trajectories as polylines and then merges them with a global interaction graph. TNT [17] extends VectorNet and combines it with multiple target reference trajectory proposals sampled from the lanes to diversify the prediction points. Unfortunately, these techniques usually use homogeneous graphs with one entity type and one relation type which prevents them from representing the rich heterogeneous traffic scene along with their complex relations.

Methods that use heterogeneous graphs, i.e. graphs with different entity types such as vehicles, bicycles or pedestrians and relation types like agent-to-lane or agent-to-agent, are recently proposed [18, 19, 20, 21, 22, 23]. However, they are limited to only a portion of the relevant information and are unable to fully capture all scene details and the interconnections between the entities. Our approach aims to fill this gap using formal ontologies for constructing a knowledge graph [24, 25, 26, 27, 28] capturing the rich information of traffic scenes. Knowledge graphs have been applied in other automotive applications like POI recommendation [29, 30] and driving situation understanding [31].

Encoding. Early encodings are based on CNNs [32, 33, 2, 10], while more recent works use GNNs [14, 5, 13, 16, 15]. Attention mechanisms have recently attracted high interest in modeling the interactive behavior between agents for raster-based approaches [34, 35, 36, 37, 38], graph-based approaches [39, 40, 41, 42, 43, 44] and map-free approaches [45]. A hierarchical vector transformer-based approach, HiVTHV is presented in [46] that consists of a local context feature encoding followed by the global message passing among agent-centric local regions. Autoregressive trajectory prediction approaches generating trajectories at intervals to produce scene-consistent multi-agent trajectories are proposed in [47, 34, 48, 49, 37]. Based on language modeling concepts with transformers, MotionLM [50] treats continuous trajectories as sequences of discrete motion tokens and cast multi-agent motion prediction as a language modeling task. In [51], a pretrained language model is used to encode text describing traffic situations combined with raster-based encodings. A game-theoretic modeling and learning approach considering relations between scene elements, alongside a novel hierarchical transformer decoder architecture is presented in [52]. We also use a transformer-based architecture but encode different information sources including map topology, meta-paths, as well as relational information.

Refer to caption
Figure 3: Illustration of traffic scene ontologies [6]: Agent Ontology defines agent attributes like category, speed, position, and trajectory, and relationships to map like distance to lane, and path distance. Map Ontology defines map elements like lane snippet, lane slice, traffic light, etc., and relations within map elements like left/right lane, switch via double dashed line.

Predicting. Goal- or intention conditioned systems sample goal candidates and predict trajectories conditioned on them [53, 54, 44, 17, 55]. Grid-based policy learning via maximum entropy inverse reinforcement learning is used in [56] to condition trajectory forecasts. Authors in  [57] use key-frames as representative states to trace out the general direction of the trajectory. Approaches considering lane-aware scene constraints that align motion dynamics with scene information are shown in [8, 58]. Our architecture is related, but we use a heterogeneous graph transformer to process the heterogeneous information of the KG. Others use anchors, fixed sets of anchor trajectories corresponding to permitted trajectories, to guide trajectory prediction [32, 3, 22, 59]. [15] presents a method to learn latent representations of anchor trajectories. Query-centric trajectory prediction is proposed in [60, 61], where agents’ decisions are formulated as information queries using the available information before they make a decision. Our approach is related but refines anchors into meta-paths by using contextual information.

III METHODOLOGY

We aim to represent all relevant information that governs the behavior of traffic participants. For example, information about lane dividers (e.g. dashed line, solid line), conveys information about permitted lane changes and is therefore important for trajectory prediction; a pedestrian crossing together with the distance and direction of nearby pedestrians governs the behavior of oncoming vehicles. As seen below, it is not only important to represent all relevant information but also their relational information. We address this challenge by representing the map and agents with a knowledge graph. This enables us to explicitly model the various map elements and agents as well as their semantic relations. It also allows for the modeling of diverse traffic agents types like cars, and bicycles, and their relations in driving situations such as whether two agents might interact, or drive behind or next to one another.

In the following, we describe a comprehensive architecture depicted in Figure 2, which uses a knowledge graph for predicting multimodal trajectories. The architecture begins by taking the scene graph gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as input and outputs multimodal trajectories for the target agent. Finally, the refinement module filters the predicted trajectories, considering anchor paths and speed profiles to avoid failure cases.

III-A Ontology and Heterogeneous Scene Graph

We utilize ontologies to explicitly represent the abundance of information from traffic scenes. Thus, based on the domain knowledge we model relationships between entities considered important for the trajectory prediction task. Figure 3 illustrates the developed ontologies, encompassing various entity and relation types. The entity types are categorized into two groups: the first one contains static entities like lane types, boundaries, center lines and stop areas; the second group contains dynamic entities like agents, their states, and bounding boxes. As for relation types, they fall into three groups: 1) between agents, which construct semantic associations such as lateral, longitudinal, and intersecting, as shown in Figure 4(b) akin to the concepts presented in [62]; 2) between map elements, establishing lane connectivity and relationships between lanes and road infrastructure elements like stop areas, traffic lights, pedestrian crossings; and 3) relations between map elements and agents, utilizing probability projection to map agents onto road infrastructure. Based on the designed ontology, we represent the scene by a heterogeneous scene graph G=(V,E,τ,ϕ)𝐺𝑉𝐸𝜏italic-ϕG=(V,E,\tau,\phi)italic_G = ( italic_V , italic_E , italic_τ , italic_ϕ ). It has nodes vV𝑣𝑉v\in Vitalic_v ∈ italic_V, their types τ(v)𝜏𝑣\tau(v)italic_τ ( italic_v ), and edges (u,v)E𝑢𝑣𝐸(u,v)\in E( italic_u , italic_v ) ∈ italic_E, with edge types ϕ(u,v)italic-ϕ𝑢𝑣\phi(u,v)italic_ϕ ( italic_u , italic_v ). The edges are directed since they are based on properties of the knowledge graph.

III-B Problem Formulation for Trajectory Prediction

We assume that the perception part can provide detailed information about agent positions, and past motion as well as the HD map, so we build the scene graph as described in the previous section. Then, a sample of the dataset can be formed as (gi,yi)subscript𝑔𝑖subscript𝑦𝑖\left(g_{i},y_{i}\right)( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) where gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a sample scene graph with trajectory information, local map, and target identifier and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ground truth future trajectory of the given target. Both agent past trajectories and map information are represented hierarchically. Further, giGsubscript𝑔𝑖𝐺g_{i}\in Gitalic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_G covers the information within a chosen time horizon {th+1,,0,1,,tf}subscript𝑡101subscript𝑡𝑓\left\{-t_{h}+1,\cdots,0,1,\cdots,t_{f}\right\}{ - italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + 1 , ⋯ , 0 , 1 , ⋯ , italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT }. We use 𝐏th+1:0i={spth+2i,spth+3i,,sp0i}superscriptsubscript𝐏:subscript𝑡10𝑖𝑠superscriptsubscript𝑝subscript𝑡2𝑖𝑠superscriptsubscript𝑝subscript𝑡3𝑖𝑠superscriptsubscript𝑝0𝑖\mathbf{P}_{-t_{h}+1:0}^{i}=\left\{sp_{-t_{h}+2}^{i},sp_{-t_{h}+3}^{i},\ldots,% sp_{0}^{i}\right\}bold_P start_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + 1 : 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_s italic_p start_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_s italic_p start_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_s italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } to represent respective scene participant nodes. Each participant node spti𝑠superscriptsubscript𝑝𝑡𝑖sp_{t}^{i}italic_s italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is modeled as spti=[dt,si,dt,ei,ai]𝑠superscriptsubscript𝑝𝑡𝑖superscriptsubscript𝑑𝑡𝑠𝑖superscriptsubscript𝑑𝑡𝑒𝑖superscript𝑎𝑖sp_{t}^{i}=\left[d_{t,s}^{i},d_{t,e}^{i},a^{i}\right]italic_s italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = [ italic_d start_POSTSUBSCRIPT italic_t , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_t , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ], where dt,sisuperscriptsubscript𝑑𝑡𝑠𝑖d_{t,s}^{i}italic_d start_POSTSUBSCRIPT italic_t , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and dt,eisuperscriptsubscript𝑑𝑡𝑒𝑖d_{t,e}^{i}italic_d start_POSTSUBSCRIPT italic_t , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT stands for previous and current time stamps participant locations, whereas aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represent additional attributes like velocity, acceleration, heading change rate and the object type. For map information we use 𝐒1:Ni={s1i,s2i,,sNi}superscriptsubscript𝐒:1𝑁𝑖superscriptsubscript𝑠1𝑖superscriptsubscript𝑠2𝑖superscriptsubscript𝑠𝑁𝑖\mathbf{S}_{1:N}^{i}=\left\{s_{1}^{i},s_{2}^{i},\ldots,s_{N}^{i}\right\}bold_S start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } to denote a lane snippet, snisuperscriptsubscript𝑠𝑛𝑖s_{n}^{i}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for lane slices and N𝑁Nitalic_N the length of the given lane snippet. Each lane slice vector sni=[dn,si,dn,ei,ai,dn,prei]superscriptsubscript𝑠𝑛𝑖superscriptsubscript𝑑𝑛𝑠𝑖superscriptsubscript𝑑𝑛𝑒𝑖subscript𝑎𝑖superscriptsubscript𝑑𝑛pre𝑖s_{n}^{i}=\left[d_{n,s}^{i},d_{n,e}^{i},a_{i},d_{n,\mathrm{pre}}^{i}\right]italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = [ italic_d start_POSTSUBSCRIPT italic_n , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_n , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_n , roman_pre end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ] adds dn,preisuperscriptsubscript𝑑𝑛pre𝑖d_{n,\mathrm{pre}}^{i}italic_d start_POSTSUBSCRIPT italic_n , roman_pre end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT to indicate the predecessor of the starting point. Connections between lane snippets are built by lane connectors 𝐂1:Ni={c1i,c2i,,cNi}superscriptsubscript𝐂:1𝑁𝑖superscriptsubscript𝑐1𝑖superscriptsubscript𝑐2𝑖superscriptsubscript𝑐𝑁𝑖\mathbf{C}_{1:N}^{i}=\left\{c_{1}^{i},c_{2}^{i},\ldots,c_{N}^{i}\right\}bold_C start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }, where each cnisuperscriptsubscript𝑐𝑛𝑖c_{n}^{i}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT encodes an ordered pose inside the lane connector of length N𝑁Nitalic_N.

Coordinates in the knowledge graph are initially in a global coordinate system. These are then transformed separately into local, scene graph-specific coordinates, with the origin at the location of the target agent and the positive y-axis pointing along the facing direction of the target.

III-C Semantic Scene Graph Hierarchical Modeling

III-C1 Meta-Path Generation

We extract meta-paths to describe permitted and possible driving directions to navigate the target participant. Meta-paths related to the permitted lane changes and turns can be divided into three groups: 1) lane-changing; 2) entering the lane connector; and 3) leaving the lane connector.

Refer to caption
(a) Meta-path Generation
Refer to caption
(b) Agent-Agent Interaction
Figure 4: (a) Illustration of meta-paths depicting permitted trajectories. (b) Illustration of the participant interaction graph: Characterized by edge types: Longitudinal(green), Intersecting(gray), Lateral(red), and Pedestrian(yellow).

Figure 4(a) gives a qualitative analysis of generated meta-paths. Specifically, we illustrate sample meta-paths below, such as lane-changing 1, leaving connector 2, and entering connector cases 3, where ΦΦ\Phiroman_Φ represents the meta-path.

Φ0=𝐏isOn𝐒switchViaX𝐒switchViaX𝐒subscriptΦ0𝐏superscript𝑖𝑠𝑂𝑛𝐒superscript𝑠𝑤𝑖𝑡𝑐𝑉𝑖𝑎𝑋𝐒superscript𝑠𝑤𝑖𝑡𝑐𝑉𝑖𝑎𝑋𝐒\Phi_{0}=\mathbf{P}\stackrel{{\scriptstyle isOn}}{{\longrightarrow}}\mathbf{S}% \stackrel{{\scriptstyle switchViaX}}{{\longrightarrow}}\mathbf{S}\stackrel{{% \scriptstyle switchViaX}}{{\longrightarrow}}\mathbf{S}roman_Φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_P start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG italic_i italic_s italic_O italic_n end_ARG end_RELOP bold_S start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG italic_s italic_w italic_i italic_t italic_c italic_h italic_V italic_i italic_a italic_X end_ARG end_RELOP bold_S start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG italic_s italic_w italic_i italic_t italic_c italic_h italic_V italic_i italic_a italic_X end_ARG end_RELOP bold_S (1)
Φ1=𝐏isOn𝐂CconnectS𝐒switchViaX𝐒subscriptΦ1𝐏superscript𝑖𝑠𝑂𝑛𝐂superscript𝐶𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑆𝐒superscript𝑠𝑤𝑖𝑡𝑐𝑉𝑖𝑎𝑋𝐒\Phi_{1}=\mathbf{P}\stackrel{{\scriptstyle isOn}}{{\longrightarrow}}\mathbf{C}% \stackrel{{\scriptstyle CconnectS}}{{\longrightarrow}}\mathbf{S}\stackrel{{% \scriptstyle switchViaX}}{{\longrightarrow}}\mathbf{S}roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_P start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG italic_i italic_s italic_O italic_n end_ARG end_RELOP bold_C start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG italic_C italic_c italic_o italic_n italic_n italic_e italic_c italic_t italic_S end_ARG end_RELOP bold_S start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG italic_s italic_w italic_i italic_t italic_c italic_h italic_V italic_i italic_a italic_X end_ARG end_RELOP bold_S (2)
Φ2=𝐏isOn𝐒switchViaX𝐒SconnectC𝐂subscriptΦ2𝐏superscript𝑖𝑠𝑂𝑛𝐒superscript𝑠𝑤𝑖𝑡𝑐𝑉𝑖𝑎𝑋𝐒superscript𝑆𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝐶𝐂\Phi_{2}=\mathbf{P}\stackrel{{\scriptstyle isOn}}{{\longrightarrow}}\mathbf{S}% \stackrel{{\scriptstyle switchViaX}}{{\longrightarrow}}\mathbf{S}\stackrel{{% \scriptstyle SconnectC}}{{\longrightarrow}}\mathbf{C}roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_P start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG italic_i italic_s italic_O italic_n end_ARG end_RELOP bold_S start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG italic_s italic_w italic_i italic_t italic_c italic_h italic_V italic_i italic_a italic_X end_ARG end_RELOP bold_S start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG italic_S italic_c italic_o italic_n italic_n italic_e italic_c italic_t italic_C end_ARG end_RELOP bold_C (3)

III-C2 Agent Motion and Lane Encoder

This component is responsible for encoding spatio-temporal information. We process participants 𝐏isuperscript𝐏𝑖\mathbf{P}^{i}bold_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, lane snippets 𝐒1:Nisuperscriptsubscript𝐒:1𝑁𝑖\mathbf{S}_{1:N}^{i}bold_S start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, and lane connectors 𝐂1:Nisuperscriptsubscript𝐂:1𝑁𝑖\mathbf{C}_{1:N}^{i}bold_C start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in a sequential manner using both a Graph Neural Network (GNN) and a Gated Recurrent Unit (GRU) layer. Their respective encodings are represented by pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and czsubscript𝑐𝑧c_{z}italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. Further, inspired by LaneGCN [13], we merge the outcomes as shown in Figure 5. Equation 4 introduces lane information to the related agents while equation 5 and equation 6 add participant information to the related lanes and lane connectors.

pi=pi+CrossAtt{pi,[sj,cz]}subscript𝑝𝑖subscript𝑝𝑖CrossAttsubscript𝑝𝑖subscript𝑠𝑗subscript𝑐𝑧p_{i}=p_{i}+\operatorname{CrossAtt}\left\{p_{i},[s_{j},c_{z}]\right\}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_CrossAtt { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , [ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ] } (4)
sj=sj+CrossAtt{sj,pi}subscript𝑠𝑗subscript𝑠𝑗CrossAttsubscript𝑠𝑗subscript𝑝𝑖s_{j}=s_{j}+\operatorname{CrossAtt}\left\{s_{j},p_{i}\right\}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + roman_CrossAtt { italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } (5)
cz=cz+CrossAtt{cz,pi}subscript𝑐𝑧subscript𝑐𝑧CrossAttsubscript𝑐𝑧subscript𝑝𝑖c_{z}=c_{z}+\operatorname{CrossAtt}\left\{c_{z},p_{i}\right\}italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + roman_CrossAtt { italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } (6)

where i{1,,NP},j{1,,NLS },z{1,,NLC }formulae-sequence𝑖1subscript𝑁Pformulae-sequence𝑗1subscript𝑁LS 𝑧1subscript𝑁LC i\in\left\{1,\ldots,N_{\text{P}}\right\},j\in\left\{1,\ldots,N_{\text{LS }}% \right\},z\in\left\{1,\ldots,N_{\text{LC }}\right\}italic_i ∈ { 1 , … , italic_N start_POSTSUBSCRIPT P end_POSTSUBSCRIPT } , italic_j ∈ { 1 , … , italic_N start_POSTSUBSCRIPT LS end_POSTSUBSCRIPT } , italic_z ∈ { 1 , … , italic_N start_POSTSUBSCRIPT LC end_POSTSUBSCRIPT }. Encodings are assigned to node attributes in scene graph gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Refer to caption
Figure 5: Illustration of the agent motion and lane encoder: GNN and GRU extracts spatio-temporal information, attention mechanism models participants related lane.

III-C3 Scene Graph Encoder

A heterogeneous graph operator is used to reason over the given scene graph gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. To better incorporate the generated meta-paths, we follow the principle from HAN [63] i.e. using a hierarchical attention structure from node-level attention to semantic-level attention as shown in figure 6.

Applying HAN to learn relational information is shown in Algorithm 1. Three distinct node types are used for the probability predictor to encode participants, lane snippets, and lane connectors. We use pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, czsubscript𝑐𝑧c_{z}italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT to represent these three types respectively, where piZsubscript𝑝𝑖𝑍p_{i}\in Zitalic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Z, sjZsubscript𝑠𝑗𝑍s_{j}\in Zitalic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_Z, czZsubscript𝑐𝑧𝑍c_{z}\in Zitalic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∈ italic_Z.

input : Heterogeneous scene graph G=(V,E,τ,ϕ)𝐺𝑉𝐸𝜏italic-ϕG=(V,E,\tau,\phi)italic_G = ( italic_V , italic_E , italic_τ , italic_ϕ )
Node feature {hi,iV,h{p,s,c}}formulae-sequencesubscript𝑖for-all𝑖𝑉𝑝𝑠𝑐\left\{h_{i},\forall i\in V,h\in\{p,s,c\}\right\}{ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i ∈ italic_V , italic_h ∈ { italic_p , italic_s , italic_c } }
Meta-path set {Φ0,Φ1,,ΦP}subscriptΦ0subscriptΦ1subscriptΦ𝑃\left\{\Phi_{0},\Phi_{1},\ldots,\Phi_{P}\right\}{ roman_Φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_Φ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT }
Number of attention head K𝐾Kitalic_K
output : Heterogeneous graph node embedding Z𝑍Zitalic_Z
1 for Φi{Φ0,Φ1,,ΦP}subscriptΦ𝑖subscriptΦ0subscriptΦ1subscriptΦ𝑃\Phi_{i}\in\left\{\Phi_{0},\Phi_{1},\ldots,\Phi_{P}\right\}roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { roman_Φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_Φ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT } do
2       for  k=1K𝑘1𝐾k=1\ldots Kitalic_k = 1 … italic_K  do
3            Type-specific transformation hiMLP{hi\mathrm{h}_{i}^{\prime}\leftarrow\text{MLP}\{\mathrm{h}_{i}roman_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← MLP { roman_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT}
4            for iV𝑖𝑉i\in Vitalic_i ∈ italic_V do
5                  Find the meta-path based neighbors NiΦsuperscriptsubscript𝑁𝑖ΦN_{i}^{\Phi}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT
6                  for jNiΦ𝑗superscriptsubscript𝑁𝑖Φj\in N_{i}^{\Phi}italic_j ∈ italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT do
7                        Calculate the weight coefficient αijΦsuperscriptsubscript𝛼𝑖𝑗Φ\alpha_{ij}^{\Phi}italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT
8                   end for
9                  
10                  Calculate the semantic-specific node embedding ziΦσ(jNiΦαijΦ𝐡j)superscriptsubscriptz𝑖Φ𝜎subscript𝑗superscriptsubscript𝑁𝑖Φsuperscriptsubscript𝛼𝑖𝑗Φsuperscriptsubscript𝐡𝑗\mathrm{z}_{i}^{\Phi}\leftarrow\sigma\left(\sum_{j\in N_{i}^{\Phi}}\alpha_{ij}% ^{\Phi}\cdot\mathbf{h}_{j}^{\prime}\right)roman_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT ← italic_σ ( ∑ start_POSTSUBSCRIPT italic_j ∈ italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT ⋅ bold_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
11             end for
12            Concatenate the learned embeddings from all attention head ziΦk=1Kσ(jNiΦαijΦ𝐡j)\mathrm{z}_{i}^{\Phi}\leftarrow\|_{k=1}^{K}\sigma\left(\sum_{j\in N_{i}^{\Phi}% }\alpha_{ij}^{\Phi}\cdot\mathbf{h}_{j}^{\prime}\right)roman_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT ← ∥ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_σ ( ∑ start_POSTSUBSCRIPT italic_j ∈ italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT ⋅ bold_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
13       end for
14      Calculate the weight of meta-path βΦisubscript𝛽subscriptΦ𝑖\beta_{\Phi_{i}}italic_β start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT Fuse the semantic-specific embedding Zi=1PβΦiZΦi𝑍superscriptsubscript𝑖1𝑃subscript𝛽subscriptΦ𝑖subscript𝑍subscriptΦ𝑖Z\leftarrow\sum_{i=1}^{P}\beta_{\Phi_{i}}\cdot Z_{\Phi_{i}}italic_Z ← ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_Z start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT
15 end for
return Z𝑍Zitalic_Z
Algorithm 1 Semantic Graph Learning via HAN

III-C4 Probability Predictor

As a result of the scene graph encoder, nodes of lane snippets sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and lane connectors cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are projected to the same dimension Z𝑍Zitalic_Z. We treat these two types of nodes as the same type and use lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to represent them. Inspired by LAFormer [8], we align the target agent motion and lane information at each future time step t{1,,tf}𝑡1subscript𝑡𝑓t\in\{1,\ldots,t_{f}\}italic_t ∈ { 1 , … , italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT }. To achieve this, we use a lane score head and an attention mechanism to predict lane encoding probabilities. In the attention mechanism, key (K𝐾Kitalic_K) and value (V𝑉Vitalic_V) vectors are produced by MLP(pi)𝑀𝐿𝑃subscript𝑝𝑖MLP(p_{i})italic_M italic_L italic_P ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), whereas the query (Q𝑄Qitalic_Q) is produced by MLP(li)𝑀𝐿𝑃subscript𝑙𝑖MLP(l_{i})italic_M italic_L italic_P ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Next, attention encodings are calculated by Ai,j=softmax(QKTdk)Vsubscript𝐴𝑖𝑗softmax𝑄superscript𝐾𝑇subscript𝑑𝑘𝑉A_{i,j}=\operatorname{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)Vitalic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = roman_softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V. The predicted score of the jth𝑗thj\text{th}italic_j th lane encodings at t𝑡titalic_t is shown in equation 7, where ϕitalic-ϕ\phiitalic_ϕ denotes MLP layers. We select top-k lane encodings to maintain the uncertainty and concatenate the candidate lane segments and associated scores over the future time steps to obtain L=ConCat{l1:k,s^1:k}t=1tfL=\operatorname{ConCat}\left\{l_{1:k},\hat{s}_{1:k}\right\}_{t=1}^{t_{f}}italic_L = roman_ConCat { italic_l start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

s^j,t=exp(ϕ{pi,lj,Ai,j})n=1NlaneΦjexp(ϕ{hi,ln,Ai,n}),subscript^𝑠𝑗𝑡italic-ϕsubscript𝑝𝑖subscript𝑙𝑗subscript𝐴𝑖𝑗superscriptsubscript𝑛1subscript𝑁lanesubscriptΦ𝑗italic-ϕsubscript𝑖subscript𝑙𝑛subscript𝐴𝑖𝑛\hat{s}_{j,t}=\frac{\exp\left(\phi\left\{p_{i},l_{j},A_{i,j}\right\}\right)}{% \sum_{n=1}^{N_{\text{lane}\in\Phi_{j}}}\exp\left(\phi\left\{h_{i},l_{n},A_{i,n% }\right\}\right)},over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_ϕ { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT lane ∈ roman_Φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( italic_ϕ { italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT } ) end_ARG , (7)

To optimize the probability estimation, we use a binary cross-entropy loss lane subscriptlane \mathcal{L}_{\text{lane }}caligraphic_L start_POSTSUBSCRIPT lane end_POSTSUBSCRIPT, as shown in equation 8. Ground truth lane segment stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT relies on the isOn relationship in the knowledge graph. Next, a cross-attention operation is performed to further fuse agent and lane information. Key and value vectors are L𝐿Litalic_L, query vector is pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The updated lane output is li,attsubscript𝑙𝑖attl_{i,\mathrm{att}}italic_l start_POSTSUBSCRIPT italic_i , roman_att end_POSTSUBSCRIPT.

lane =t=1tfCE(st,s^t)subscriptlane superscriptsubscript𝑡1subscript𝑡fsubscriptCEsubscript𝑠𝑡subscript^𝑠𝑡\mathcal{L}_{\text{lane }}=\sum_{t=1}^{t_{\mathrm{f}}}\mathcal{L}_{\mathrm{CE}% }\left(s_{t},\hat{s}_{t}\right)caligraphic_L start_POSTSUBSCRIPT lane end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (8)
Refer to caption
Figure 6: Illustration of node and semantic levels of attention from the respective of the traffic graph. All traffic participants will receive guidance from corresponding meta-paths.

Then we employ a predictor for generating multimodal trajectories. This is realized by sampling a latent vector z𝑧zitalic_z from a multivariate normal distribution and adding it to the fusion encodings. Next, a Laplacian mixture density network (MDN) decoder is used to output a set of trajectories m=1Mπ^mLaplace(μ,b)superscriptsubscript𝑚1𝑀subscript^𝜋𝑚Laplace𝜇𝑏\sum_{m=1}^{M}\hat{\pi}_{m}\operatorname{Laplace}(\mu,b)∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_Laplace ( italic_μ , italic_b ). π^msubscript^𝜋𝑚\hat{\pi}_{m}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the probability of each mode and m=1Mπ^m=1superscriptsubscript𝑚1𝑀subscript^𝜋𝑚1\sum_{m=1}^{M}\hat{\pi}_{m}=1∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 1. μ𝜇\muitalic_μ and b𝑏bitalic_b represent the location and scale parameters of each Laplace component. We use an MLP to predict π^msubscript^𝜋𝑚\hat{\pi}_{m}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, a GRU to recover the time dimension tfsubscript𝑡ft_{\mathrm{f}}italic_t start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT of the predictions, and two MLPs to predict μ𝜇\muitalic_μ and b𝑏bitalic_b. The predictor is trained by minimizing a regression loss and a classification loss. Regression loss is computed using the Winner-Takes-All strategy as shown in equation 9.

reg=1tft=1tflogP(Ytμtm,btm)subscriptreg1subscript𝑡𝑓superscriptsubscript𝑡1subscript𝑡𝑓𝑃conditionalsubscript𝑌𝑡superscriptsubscript𝜇𝑡superscript𝑚superscriptsubscript𝑏𝑡superscript𝑚\mathcal{L}_{\mathrm{reg}}=\frac{1}{t_{f}}\sum_{t=1}^{t_{f}}-\log P\left(Y_{t}% \mid\mu_{t}^{m^{*}},b_{t}^{m^{*}}\right)caligraphic_L start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - roman_log italic_P ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) (9)

where Y𝑌Yitalic_Y is the ground truth position and msuperscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represents the best mode which has minimum L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT error among the M𝑀Mitalic_M predictions. Cross-entropy loss is used to optimize the mode classification as shown in equation 10.

cls=m=1Mπmlog(π^m).subscriptclssuperscriptsubscript𝑚1𝑀subscript𝜋𝑚subscript^𝜋𝑚\mathcal{L}_{\mathrm{cls}}=\sum_{m=1}^{M}-\pi_{m}\log\left(\hat{\pi}_{m}\right).caligraphic_L start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) . (10)

Several metrics are used to evaluate the deviation from the ground truth, like velocity loss and angle loss, and investigate the influence of different measurements on the predictions. For the velocity loss, we calculate the ground truth velocity traces Vt=YtYt12subscript𝑉𝑡subscriptnormsubscript𝑌𝑡subscript𝑌𝑡12V_{t}=\|Y_{t}-Y_{t-1}\|_{2}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and prediction velocity traces Vt^=μtμt12^subscript𝑉𝑡subscriptnormsubscript𝜇𝑡subscript𝜇𝑡12\hat{V_{t}}=\|\mu_{t}-\mu_{t-1}\|_{2}over^ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = ∥ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then velocity loss is shown in equation 11.

velocity=1tft=1tflogP(VtVt^m,btm)subscriptvelocity1subscript𝑡𝑓superscriptsubscript𝑡1subscript𝑡𝑓𝑃conditionalsubscript𝑉𝑡superscript^subscript𝑉𝑡superscript𝑚superscriptsubscript𝑏𝑡superscript𝑚\mathcal{L}_{\mathrm{velocity}}=\frac{1}{t_{f}}\sum_{t=1}^{t_{f}}-\log P\left(% V_{t}\mid\hat{V_{t}}^{m^{*}},b_{t}^{m^{*}}\right)caligraphic_L start_POSTSUBSCRIPT roman_velocity end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - roman_log italic_P ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ over^ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) (11)

For the angle loss, X0subscript𝑋0X_{0}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is used to denote the initial position and we calculate ground truth angle θt=arctan2(YtX0)subscript𝜃𝑡2subscript𝑌𝑡subscript𝑋0\theta_{t}=\arctan 2\left(Y_{t}-X_{0}\right)italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_arctan 2 ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and prediction angle θ^t=arctan2(μtX0)subscript^𝜃𝑡2subscript𝜇𝑡subscript𝑋0\hat{\theta}_{t}=\arctan 2\left(\mu_{t}-X_{0}\right)over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_arctan 2 ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). The following equation 12 shows the calculation of the loss:

angle =1tft=1tfcos(θ^tθt)subscriptangle 1subscript𝑡𝑓superscriptsubscript𝑡1subscript𝑡𝑓subscript^𝜃𝑡subscript𝜃𝑡\mathcal{L}_{\text{angle }}=\frac{1}{t_{f}}\sum_{t=1}^{t_{f}}-\cos\left(\hat{% \theta}_{t}-\theta_{t}\right)caligraphic_L start_POSTSUBSCRIPT angle end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - roman_cos ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (12)

The total loss for the motion prediction is given by 13.

=λ1lane +λ2velocity+λ3angle+reg+clssubscript𝜆1subscriptlane subscript𝜆2subscriptvelocitysubscript𝜆3subscriptanglesubscriptregsubscriptcls\mathcal{L}=\lambda_{1}\mathcal{L}_{\text{lane }}+\lambda_{2}\mathcal{L}_{% \text{velocity}}+\lambda_{3}\mathcal{L}_{\text{angle}}+\mathcal{L}_{\text{reg}% }+\mathcal{L}_{\mathrm{cls}}caligraphic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT lane end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT velocity end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT angle end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT (13)

III-D Prediction Refinement

To filter out the unreasonable predictions, we analyze the predicted trajectories by anchor paths [59]. Anchor paths provide possible and permitted trajectories for an agent at a given position in the road network. Anchor paths are used to filter out trajectory candidates far from these anchor paths. Next, we cluster the remaining trajectory candidates w.r.t. their speed profiles and keep the top candidates closest to the cluster centers. For an unfair comparison, we also perform experiments using the ground truth speed profile to get an idea about the relevance of the speed component in the prediction results. Details are shown in Algorithm 2.

1
input : Predictions {μ1:tf1,μ1:tf2,,μ1:tfk}superscriptsubscript𝜇:1subscript𝑡𝑓1superscriptsubscript𝜇:1subscript𝑡𝑓2superscriptsubscript𝜇:1subscript𝑡𝑓𝑘\left\{\mu_{1:t_{f}}^{1},\mu_{1:t_{f}}^{2},\ldots,\mu_{1:t_{f}}^{k}\right\}{ italic_μ start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT }
Predicted Probabilities {π1,π2,,πk}subscript𝜋1subscript𝜋2subscript𝜋𝑘\{\pi_{1},\pi_{2},\ldots,\pi_{k}\}{ italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }
Anchor Paths {P1,P2,P5}subscript𝑃1subscript𝑃2subscript𝑃5\{P_{1},P_{2},\ldots P_{5}\}{ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT }
output : Filtered Predictions{Y^1:tf1,Y^1:tf2,,Y^1:tf5}superscriptsubscript^𝑌:1subscript𝑡𝑓1superscriptsubscript^𝑌:1subscript𝑡𝑓2superscriptsubscript^𝑌:1subscript𝑡𝑓5\left\{\hat{Y}_{1:t_{f}}^{1},\hat{Y}_{1:t_{f}}^{2},\ldots,\hat{Y}_{1:t_{f}}^{5% }\right\}{ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT }
2 if Ground Truth speed profile sgtsubscript𝑠𝑔𝑡s_{gt}italic_s start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT available then
3      Calculate the speed profiles s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, …, sksubscript𝑠𝑘s_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
4      Calculate similarity to sgtsubscript𝑠𝑔𝑡s_{gt}italic_s start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT using Dynamic Time War** (DTW)
5      Select 5 most similar predictions {Y^1:tf1,Y^1:tf2,,Y^1:tf5}superscriptsubscript^𝑌:1subscript𝑡𝑓1superscriptsubscript^𝑌:1subscript𝑡𝑓2superscriptsubscript^𝑌:1subscript𝑡𝑓5\left\{\hat{Y}_{1:t_{f}}^{1},\hat{Y}_{1:t_{f}}^{2},\ldots,\hat{Y}_{1:t_{f}}^{5% }\right\}{ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT }
6 end if
7else
8       for Pi{P1,P2,P5}subscript𝑃𝑖subscript𝑃1subscript𝑃2subscript𝑃5P_{i}\in\{P_{1},P_{2},\ldots P_{5}\}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } do
9             for μ1:tfj{μ1:tf1,μ1:tf2,,μ1:tfk}superscriptsubscript𝜇:1subscript𝑡𝑓𝑗superscriptsubscript𝜇:1subscript𝑡𝑓1superscriptsubscript𝜇:1subscript𝑡𝑓2superscriptsubscript𝜇:1subscript𝑡𝑓𝑘\mu_{1:t_{f}}^{j}\in\left\{\mu_{1:t_{f}}^{1},\mu_{1:t_{f}}^{2},\ldots,\mu_{1:t% _{f}}^{k}\right\}italic_μ start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ { italic_μ start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } do
10                   Calculate the distance dijsubscript𝑑𝑖𝑗d_{ij}italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT between Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and μ1:tfjsuperscriptsubscript𝜇:1subscript𝑡𝑓𝑗\mu_{1:t_{f}}^{j}italic_μ start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT
11             end for
12            For each i𝑖iitalic_i, select the min5dij𝑚𝑖subscript𝑛5subscript𝑑𝑖𝑗min_{5}d_{ij}italic_m italic_i italic_n start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and calculate the speed profiles si1subscript𝑠𝑖1s_{i1}italic_s start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT, si2subscript𝑠𝑖2s_{i2}italic_s start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT, si3subscript𝑠𝑖3s_{i3}italic_s start_POSTSUBSCRIPT italic_i 3 end_POSTSUBSCRIPT, si4subscript𝑠𝑖4s_{i4}italic_s start_POSTSUBSCRIPT italic_i 4 end_POSTSUBSCRIPT, si5subscript𝑠𝑖5s_{i5}italic_s start_POSTSUBSCRIPT italic_i 5 end_POSTSUBSCRIPT.
13            Cluster speed profiles sijsubscript𝑠𝑖𝑗s_{ij}italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT using K-means and output the prediction Y^1:tfisuperscriptsubscript^𝑌:1subscript𝑡𝑓𝑖\hat{Y}_{1:t_{f}}^{i}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT closest to the cluster centers.
14       end for
15      
16 end if
return {Y^1:tf1,Y^1:tf2,,Y^1:tf5}{μ1:tf1,μ1:tf2,,μ1:tfk}superscriptsubscript^𝑌:1subscript𝑡𝑓1superscriptsubscript^𝑌:1subscript𝑡𝑓2superscriptsubscript^𝑌:1subscript𝑡𝑓5superscriptsubscript𝜇:1subscript𝑡𝑓1superscriptsubscript𝜇:1subscript𝑡𝑓2superscriptsubscript𝜇:1subscript𝑡𝑓𝑘\left\{\hat{Y}_{1:t_{f}}^{1},\hat{Y}_{1:t_{f}}^{2},\ldots,\hat{Y}_{1:t_{f}}^{5% }\right\}\subseteq\left\{\mu_{1:t_{f}}^{1},\mu_{1:t_{f}}^{2},\ldots,\mu_{1:t_{% f}}^{k}\right\}{ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT } ⊆ { italic_μ start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT 1 : italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT }
Algorithm 2 Prediction Refinement

IV EXPERIMENTS

IV-A Dataset & nuScenes Knowledge Graph

The nuScenes dataset [7] is a popular dataset for self-driving cars that is gathered in Boston and Singapore. It encompasses 1000 scenes, each lasting 20 seconds, and includes meticulously annotated ground truth details along with high-definition (HD) maps. The vehicles within this dataset have 3D bounding boxes manually annotated and published at a rate of 2 Hz. For the prediction task, the objective involves leveraging the preceding 2 seconds of object history and the map data to forecast the subsequent 6 seconds. We adhere to the standard split provided by the nuScenes benchmark description. Using our proposed ontology to the nuScenes dataset, we generate the nuScenes Knowledge Graph including agent and map information as described in [6]. Features are provided by the upstream perception components and the HD map from the nuScenes dataset. Table I and  II list the used feature sets for each node type and each relation type. All features that express a category type are one-hot encoded.

TABLE I: Node Type Features
View Node type Features
Agent SceneParticipant Orientation, State, Position, Velocity, Acceleration, Heading Change, Distance to Centerline
Participant Type, Size
Sequence Timestamp
Scene -
Map LaneSnippet Length
LaneSlice Width, Center Pose
LaneConnector -
OrderedPose Center Pose
Lane -
CarparkArea -
Walkway -
Intersection -
PedCrossingStopArea -
StopSignArea -
TrafficLightStopArea -
TurnStopArea -
YieldStopAre -
TABLE II: Relation Type Features
View Relation type Features
Agent hasSceneParticipant -
inNextScene Time Elapsed
hasNextScene Time Elapsed
hasPreviousScene Time Elapsed
isSceneParticipant -
Map switchViaDoubleDashedWhite -
switchViaRoadDivider -
switchViaSingleZigzagWhite -
switchViaDoubleSolidWhite -
switchViaSingleSolidYellow -
switchViaSingleSolidWhite -
isSlice/PoseOnStopArea -
connectsIncoming/Outgoing -
hasNextLane/Snippet/Slice -
Interaction isOnMapElement Probability
relatedLongitudinal Path/Distance
relatedLateral Path/Distance
relatedIntersecting Path/Distance
relatedPedestrian Distance

IV-B Metrics

We utilize standard evaluation metrics to assess the prediction performance, specifically employing ADEK𝐴𝐷subscript𝐸𝐾ADE_{K}italic_A italic_D italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT (Average Displacement Error for K𝐾Kitalic_K modes) and FDEK𝐹𝐷subscript𝐸𝐾FDE_{K}italic_F italic_D italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT (Final Displacement Error for K𝐾Kitalic_K modes). These metrics gauge L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT errors, both at the final step and averaged across each step for predicting K𝐾Kitalic_K modes. The reported minimum error among the K𝐾Kitalic_K modes is considered. Both ADE and FDE are measured in meters. Additionally, the miss rate MRK𝑀subscript𝑅𝐾MR_{K}italic_M italic_R start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT calculates the percentage of scenarios where the final-step error exceeds 2 meters.

IV-C Model Implementation

The hidden dimension of vectors in the pipeline is set to 32. The layer of the heterogeneous graph neural network is set to 1 and sum is used as the aggregation method. The attention head in HAN is set to 8 whereas values for parameters of equation 13, λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are set to 0.95, 1, and 1, respectively.

We use all agent and map elements within the four closest roadblocks. The coordinate system in the model is the BEV centered at the agent location at t=0𝑡0t=0italic_t = 0. We use the orientation from the agent location at t=1𝑡1t=-1italic_t = - 1 to the agent location at t=0𝑡0t=0italic_t = 0 as the positive x-axis. The model is trained on a TESLA-V100 GPU, with a batch size of 32, and the Adam optimizer with an initial learning rate of 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT decayed by 0.7 per 5 epochs.

IV-D Quantitative Results

We compare our results on the nuScenes online benchmark as shown in Table III. The SemanticFormer method predicts directly 5 trajectories without prediction refinement, whereas its extension, SemanticFormerR, predicts 25 trajectories and then refines those predictions. As can be observed, SemanticFormerR achieves competitive performance, thus indicating the benefit of leveraging complex and heterogeneous scene information represented in the Knowledge Graph. Also, it suggests that the speed profiles have a huge impact on future trajectories. In an unfair comparison, utilizing ground truth speed followed by Algorithm 2, SemanticFormerR demonstrates a significant superiority over state-of-the-art methods.

TABLE III: Performance Table on nuScenes Benchmark
Method GT Speed K=1 FDE K=5 K=10
ADE MR ADE MR
CoverNet [3] ×\times× 11.36 1.96 0.67 1.48 -
Trajectron++ [49] ×\times× 9.52 1.88 0.70 1.51 0.57
LaPred [64] ×\times× 8.37 1.47 0.53 1.12 0.46
P2T [56] ×\times× 10.50 1.45 0.64 1.16 0.46
LaneGCN [13] ×\times× - - 0.49 0.95 0.36
GOHOME [65] ×\times× 6.99 1.42 0.57 1.15 0.47
Autobot [38] ×\times× 8.19 1.37 0.62 1.03 0.44
THOMAS [12] ×\times× 6.71 1.33 0.55 1.04 -
PGP [66] ×\times× 7.17 1.30 0.61 1.00 0.37
LaFormer [8] ×\times× 6.95 1.19 0.48 0.93 0.33
Socialea [61] ×\times× 6.77 1.18 0.48 1.02 0.44
FRM [58] ×\times× 6.59 1.18 0.48 0.88 0.30
SemanticFormer ×\times× 6.29 1.15 0.48 0.91 0.31
SemanticFormerR ×\times× 6.27 1.14 0.50 0.87 0.30
DMAP [59] \checkmark - 1.09 0.19 1.07 0.18
SemanticFormerR \checkmark 3.88 0.86 0.26 0.78 0.13

IV-E Ablation study

IV-E1 Effect of Topological Structure of Heterogeneous Graph

Knowledge graph provides explicit and logical relationships between different heterogeneous nodes. We study the performance improvement compared to fully connected or unconnected graph structure as shown in Table IV.

TABLE IV: Ablation Study of Graph Topological Structure
Graph Topology Edge Types K=5
ADE FDE
Knowledge Graph 46 1.15 2.20
Fully Connected Graph 1 1.19 2.31
Fully Unconnected Graph 0 1.24 2.46

IV-E2 Effect of Individual Components

Our proposed heterogeneous graph is mainly composed of four parts which are map topology, meta-paths, agent-map relationships, and agent-agent relationships. We investigate the impact of drop** certain inputs to the model as shown in Table V.

TABLE V: Ablation Study for Graph Components
Meta- Paths Map- Topology Agent- Map Agent- Agent K=5
ADE FDE
\checkmark \checkmark \checkmark \checkmark 1.15 2.20
×\times× \checkmark \checkmark \checkmark 1.18 2.29
\checkmark ×\times× \checkmark \checkmark 1.17 2.26
×\times× ×\times× \checkmark \checkmark 1.22 2.39
×\times× ×\times× ×\times× \checkmark 1.23 2.42
×\times× ×\times× ×\times× ×\times× 1.24 2.46

IV-E3 Integration to other Models

We integrate our proposed Knowledge Graph into other graph-based models like VectorNet and LaFormer. Table VI shows the experimental results indicating that the Knowledge Graph can effectively improve the performance of the chosen methods.

TABLE VI: Ablation Study for Integrating other Architectures
Architectures ADE_5 FDE_1 OffRoadRate
VectorNet [5] 1.34 7.98 0.04
VectorNet + KG 1.26 7.55 0.03
LaFormer [8] 1.19 6.95 0.02
LaFormer + KG 1.15 6.29 0.02

IV-E4 Effect of Heterogeneous Graph Operators

We analyze different heterogeneous graph operators like HGT [67] and HAN [63]. As shown in Table VII, to prevent overfitting, we merge sub-classes like single solid, double solid, etc, to switchViaPermitted and switchViaNonPermitted relationships. *N means number of layers of operator is N.

TABLE VII: Ablation Study for HGNN Operators
Interaction Graph Oper- ators Self Loop Meta Path K=5
ADE FDE
Original HGT2superscriptHGT2\mathrm{HGT}^{*}2roman_HGT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 2 \checkmark ×\times× 1.24 2.49
Compact HGT2superscriptHGT2\mathrm{HGT}^{*}2roman_HGT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 2 \checkmark ×\times× 1.24 2.46
Compact HGT2superscriptHGT2\mathrm{HGT}^{*}2roman_HGT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 2 ×\times× ×\times× 1.22 2.38
Compact HAN2superscriptHAN2\mathrm{HAN}^{*}2roman_HAN start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 2 ×\times× \checkmark 1.19 2.34
Compact HAN1superscriptHAN1\mathrm{HAN}^{*}1roman_HAN start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 1 ×\times× \checkmark 1.15 2.20

IV-F Qualitative results

A qualitative visualization of our predictions is depicted in Figure 7. Green trajectories are ground truth and red trajectories are five predictions. Row 1 shows predictions considering all driving path possibilities and row 2 captures the lane-changing situation successfully.

Refer to caption
Figure 7: Illustration of the qualitative result. Column 1 is the traffic scene and column 2 is the results of SemanticFormerR.

V CONCLUSIONS

This paper proposes a novel approach using a traffic scene knowledge graph leveraging past trajectories and an HD map as input for predicting a set of multimodal trajectories. A scene graph encoder module aims to capture the interactions in a traffic scene from four aspects, agent-agent interaction, agent-map interaction, map-map interaction, and meta-paths interaction. Further, the refinement module considers the typical speed profiles and anchor paths to refine trajectory candidates. Our approach achieves excellent results compared to the state-of-the-art model, We also provide an experimental justification of our approach by performing experiments with two SOTA methods, i.e. LaFormer and VectorNet, and replacing their original homogeneous graphs with our Knowledge Graph. We show that the Knowledge Graph improves the performance of those methods by 5% and 4%, respectively. Moreover, extensive ablation and sensitivity studies also indicate that our proposed Knowledge Graph can be easily integrated into other graph-based methods to improve performance. Future work will focus on extending the Knowledge Graph with additional information such as traffic rules, traffic signs, and forms of driving common sense knowledge.

References

  • [1] C. Ju, Z. Wang, C. Long, X. Zhang, and D. E. Chang, “Interaction-aware kalman neural networks for trajectory prediction,” in 2020 IEEE Intelligent Vehicles Symposium (IV), 2020, pp. 1793–1800.
  • [2] H. Cui, V. Radosavljevic, F.-C. Chou, T.-H. Lin, et al., “Multimodal trajectory predictions for autonomous driving using deep convolutional networks,” ICRA, pp. 2090–2096, 2019.
  • [3] T. Phan-Minh, E. C. Grigore, F. A. Boulton, et al., “CoverNet: Multimodal behavior prediction using trajectory sets,” IEEE/CVF CVPR, 2019.
  • [4] J. Li, F. Yang, M. Tomizuka, et al., “EvolveGraph: Multi-agent trajectory prediction with dynamic relational reasoning,” NeurIPS, 2020.
  • [5] J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid, “VectorNet: Encoding hd maps and agent dynamics from vectorized representation,” 2020 IEEE/CVF CVPR, pp. 11 522–11 530, 2020.
  • [6] L. Mlodzian, Z. Sun, H. Berkemeyer, S. Monka, Z. Wang, S. Dietze, L. Halilaj, and J. Luettin, “nuScenes knowledge graph - A comprehensive semantic representation of traffic scenes for trajectory prediction,” in IEEE/CVF ICCV 2023 - Workshops, 2023, pp. 42–52.
  • [7] H. Caesar, V. Bankiti, A. H. Lang, et al., “nuScenes: A multimodal dataset for autonomous driving,” in IEEE/CVF CVPR, 2020.
  • [8] M. Liu, H. Cheng, L. Chen, H. Broszio, J. Li, R. Zhao, M. Sester, and M. Y. Yang, “LAformer: Trajectory prediction for autonomous driving with lane-aware scene constraints,” in IEEE/CVF CVPR, 2024.
  • [9] N. Djuric, V. Radosavljevic, H. Cui, T. Nguyen, F.-C. Chou, T.-H. Lin, et al., “Uncertainty-aware short-term motion prediction of traffic actors for autonomous driving,” 2020 IEEE WACV, pp. 2084–2093, 2018.
  • [10] J. Hong, B. Sapp, and J. Philbin, “Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions,” 2019 IEEE/CVF CVPR, pp. 8446–8454, 2019.
  • [11] T. Gilles, S. Sabatini, D. Tsishkou, B. Stanciulescu, and F. Moutarde, “HOME: Heatmap output for future motion estimation,” in ITSC, 2021.
  • [12] ——, “THOMAS: trajectory heatmap output with learned multi-agent sampling,” ICLR, 2022.
  • [13] M. Liang, B. Yang, R. Hu, Y. Chen, R. Liao, et al., “Learning lane graph representations for motion forecasting,” ECCV, 2020.
  • [14] S. Casas, C. Gulino, R. Liao, and R. Urtasun, “SpAGNN: Spatially-aware graph neural networks for relational behavior forecasting from sensor data,” ICRA, pp. 9491–9497, 2019.
  • [15] B. Varadarajan, A. S. Hefny, A. Srivastava, K. S. Refaat, et al., “Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,” ICRA, pp. 7814–7821, 2021.
  • [16] S. Konev, “MPA: Multipath++ based architecture for motion prediction,” IEEE/CVF CVPR Workshop on Autonomous Driving, 2022.
  • [17] H. Zhao, J. Gao, T. Lan, et al., “TNT: Target-driven trajectory prediction,” in Conference on Robot Learning, 2020.
  • [18] X. Mo, Z. Huang, Y. Xing, and C. Lv, “Multi-agent trajectory prediction with heterogeneous edge-enhanced graph attention network,” IEEE Transactions on ITS, vol. 23, pp. 9554–9567, 2022.
  • [19] X. Jia, P. Wu, L. Chen, Y. Liu, H. Li, and J. Yan, “HDGT: heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding,” IEEE Trans. PAMI, vol. 45, no. 11, pp. 13 860–13 875, 2023.
  • [20] T. Monninger, J. Schmidt, J. Rupprecht, D. Raba, et al., “SCENE: Reasoning about traffic scenes using heterogeneous graph neural networks,” IEEE Robotics and Automation Letters, vol. 8, no. 3, 2023.
  • [21] S. Wonsak, M. Al-Rifai, M. Nolting, and W. Nejdl, “Multi-modal motion prediction with graphormers,” in ITSC.   IEEE, 2022.
  • [22] D. Grimm, M. Zipfl, F. Hertlein, A. Naumann, J. Luettin, S. Thoma, S. Schmid, L. Halilaj, A. Rettinger, and J. M. Zöllner, “Heterogeneous graph-based trajectory prediction using local map context and social interactions,” IEEE ITSC, pp. 2901–2907, 2023.
  • [23] Z. Wang, Z. Sun, J. Luettin, and L. Halilaj, “SocialFormer: Social interaction modeling with edge-enhanced heterogeneous graph transformers for trajectory prediction,” 2024. [Online]. Available: https://arxiv.longhoe.net/abs/2405.03809
  • [24] L. Halilaj, J. Luettin, C. A. Henson, and S. Monka, “Knowledge graphs for automated driving,” IEEE AIKE, pp. 98–105, 2022.
  • [25] L. Halilaj, J. Luettin, S. Monka, C. A. Henson, and S. Schmid, “Knowledge graph-based integration of autonomous driving datasets,” Int. J. Semantic Comput., vol. 17, pp. 249–271, 2023.
  • [26] J. Luettin, S. Monka, C. A. Henson, and L. Halilaj, “A survey on knowledge graph-based methods for automated driving,” in Knowledge Graphs and Semantic Web, KGSWC.   Springer, 2022, pp. 16–31.
  • [27] S. Xiong, Y. Yang, F. Fekri, and J. C. Kerce, “TILP: Differentiable learning of temporal logical rules on knowledge graphs,” arXiv preprint arXiv:2402.12309, 2024.
  • [28] S. Xiong, A. Payani, R. Kompella, and F. Fekri, “Large language models can learn temporal reasoning,” arXiv preprint arXiv:2401.06853, 2024.
  • [29] L. Halilaj, J. Luettin, S. Rothermel, S. K. Arumugam, and I. Dindorkar, “Towards a knowledge graph-based approach for context-aware points-of-interest recommendations,” ACM SAC, 2021.
  • [30] S. Werner, A. Rettinger, L. Halilaj, and J. Luettin, “RETRA: Recurrent transformers for learning temporally contextualized knowledge graph embeddings,” in Extended Semantic Web Conference, 2020.
  • [31] L. Halilaj, I. Dindorkar, J. Luettin, and S. Rothermel, “A knowledge graph-based approach for situation comprehension in driving scenarios,” in Extended Semantic Web Conference, 2021.
  • [32] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov, “MultiPath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” in Conference on Robot Learning, 2019.
  • [33] S. Casas, W. Luo, and R. Urtasun, “IntentNet: Learning to predict intention from raw sensor data,” in Conference on Robot Learning, 2018.
  • [34] Y. Tang and R. Salakhutdinov, “Multiple futures prediction,” in Neural Information Processing Systems, 2019.
  • [35] K. Messaoud, I. Yahiaoui, A. Verroust-Blondet, and F. Nashashibi, “Attention based vehicle trajectory prediction,” IEEE Trans. Intell. Veh., vol. 6, no. 1, pp. 175–185, 2021.
  • [36] S. Park, G. Lee, M. Bhat, et al., “Diverse and admissible trajectory forecasting through multimodal context understanding,” in ECCV, 2020.
  • [37] Y. Yuan, X. Weng, Y. Ou, and K. Kitani, “AgentFormer: Agent-aware transformers for socio-temporal multi-agent forecasting,” IEEE/CVF ICCV, pp. 9793–9803, 2021.
  • [38] R. Girgis, F. Golemo, F. Codevilla, et al., “Latent variable sequential set transformers for joint multi-agent motion prediction.”   ICLR, 2021.
  • [39] S. Khandelwal, W. Qi, J. Singh, et al., “What-if motion prediction for autonomous driving,” in IEEE/RJS IROS, 2022.
  • [40] Y. Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou, “Multimodal motion prediction with stacked transformers,” in 2021 IEEE/CVF CVPR, 2021.
  • [41] Z. Huang, X. Mo, and C. Lv, “Multi-modal motion prediction with transformer-based neural network for autonomous driving,” ICRA, 2021.
  • [42] J. Ngiam, V. Vasudevan, B. Caine, Z. Zhang, et al., “Scene Transformer: A unified architecture for predicting future trajectories of multiple agents,” in ICLR, 2022.
  • [43] N. Nayakanti, R. Al-Rfou, A. Zhou, et al., “Wayformer: Motion forecasting via simple & efficient attention networks,” ICRA, 2022.
  • [44] S. Shi, L. Jiang, D. Dai, and B. Schiele, “Motion transformer with global intention localization and local movement refinement,” in NeurIPS, 2022.
  • [45] J. P. Mercat, T. Gilles, N. E. Zoghby, et al., “Multi-head attention for multi-modal joint vehicle motion forecasting,” ICRA, 2019.
  • [46] Z. Zhou, L. Ye, J. Wang, K. Wu, and K. Lu, “HiVT: Hierarchical vector transformer for multi-agent motion prediction,” CVPR, 2022.
  • [47] N. Rhinehart, R. T. McAllister, K. Kitani, and S. Levine, “PRECOG: Prediction conditioned on goals in visual multi-agent settings,” IEEE/CVF ICCV, pp. 2821–2830, 2019.
  • [48] E. Amirloo, A. Rasouli, P. Lakner, M. Rohani, and J. Luo, “LatentFormer: Multi-agent transformer-based interaction modeling and trajectory prediction,” ArXiv, vol. abs/2203.01880, 2022.
  • [49] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data,” in ECCV, 2020.
  • [50] A. Seff, B. Cera, D. Chen, et al., “MotionLM: Multi-agent motion forecasting as language modeling,” ICCV, 2023.
  • [51] A. Keysan, A. Look, E. Kosman, G. Gürsun, J. Wagner, Y. Yao, and B. Rakitsch, “Can you text what is happening? integrating pre-trained language encoders into trajectory prediction models for autonomous driving,” ArXiv, vol. abs/2309.05282, 2023.
  • [52] Z. Huang, H. Liu, and C. Lv, “GameFormer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving,” IEEE/CVF ICCV, pp. 3880–3890, 2023.
  • [53] S. Casas, C. Gulino, S. Suo, K. Luo, et al., “Implicit latent variable model for scene-consistent motion forecasting,” in ECCV, 2020.
  • [54] S. V. Albrecht, C. Brewitt, J. Wilhelm, et al., “Interpretable goal-based prediction and planning for autonomous driving,” ICRA, 2020.
  • [55] J. Gu, C. Sun, and H. Zhao, “DenseTNT: End-to-end trajectory prediction from dense goal sets,” IEEE/CVF ICCV, 2021.
  • [56] N. Deo and M. M. Trivedi, “Trajectory forecasts in unknown environments conditioned on grid-based plans,” ArXiv, vol. abs/2001.00735, 2020.
  • [57] Q. Lu, W. Han, J. Ling, et al., “KEMP: Keyframe-based hierarchical end-to-end deep model for long- term trajectory prediction,” ICRA, 2022.
  • [58] D.-H. Park, H. Ryu, Y. Yang, J. Cho, et al., “Leveraging future relationship reasoning for vehicle trajectory prediction,” ICLR, 2023.
  • [59] A. Naumann, F. Hertlein, D. Grimm, M. Zipfl, S. Thoma, A. Rettinger, L. Halilaj, J. Luettin, S. Schmid, and H. Caesar, “Lanelet2 for nuscenes: Enabling spatial semantic relationships and diverse map-based anchor paths,” in IEEE/CVF CVPR, 2023, pp. 3247–3256.
  • [60] Z. Zhou, J. Wang, Y.-H. Li, and Y.-K. Huang, “Query-centric trajectory prediction,” 2023 IEEE/CVF CVPR, pp. 17 863–17 873, 2023.
  • [61] J. Chen, Z. Wang, J. Wang, and B. Cai, “Q-EANet: Implicit social modeling for trajectory prediction via experience-anchored queries,” IET Intelligent Transport Systems, 2023.
  • [62] M. Zipfl, F. Hertlein, A. Rettinger, S. Thoma, L. Halilaj, J. Luettin, S. Schmid, and C. A. Henson, “Relation-based motion prediction using traffic scene graphs,” IEEE ITSC, pp. 825–831, 2022.
  • [63] X. Wang, H. Ji, C. Shi, B. Wang, et al., “Heterogeneous graph attention network,” in The world wide web conference, 2019.
  • [64] B. Kim, S. H. Park, S. Lee, et al., “LaPred: Lane-aware prediction of multi-modal future trajectories of dynamic agents,” in IEEE/CVF CVPR, 2021.
  • [65] T. Gilles, S. Sabatini, D. V. Tsishkou, et al., “GOHOME: Graph-oriented heatmap output for future motion estimation,” ICRA, 2021.
  • [66] N. Deo, E. M. Wolff, and O. Beijbom, “Multimodal trajectory prediction conditioned on lane-graph traversals,” in CoRL, 2021.
  • [67] Z. Hu, Y. Dong, K. Wang, and Y. Sun, “Heterogeneous graph transformer,” in Proceedings of the web conference, 2020, pp. 2704–2710.