SemanticFormer: Holistic and Semantic Traffic Scene Representation for Trajectory Prediction using Knowledge Graphs

Zhigang Sun¹, Zixu Wang^2,3, Lavdim Halilaj³, Juergen Luettin³ ¹Zhigang Sun is with Bosch Center for Artificial Intelligence, (Corresponding author: Zhigang Sun) [email protected], [email protected]³Zixu Wang, Lavdim Halilaj, Juergen Luettin are with Robert Bosch GmbH {firstname.lastname}@bosch.com²Zixu Wang is with the Technical University of Munich (TUM), Germany [email protected]

Abstract

Trajectory prediction in autonomous driving relies on accurate representation of all relevant contexts of the driving scene, including traffic participants, road topology, traffic signs, as well as their semantic relations to each other. Despite increased attention to this issue, most approaches in trajectory prediction do not consider all of these factors sufficiently. We present SemanticFormer, an approach for predicting multimodal trajectories by reasoning over a semantic traffic scene graph using a hybrid approach. It utilizes high-level information in the form of meta-paths, i.e. trajectories on which an agent is allowed to drive from a knowledge graph which is then processed by a novel pipeline based on multiple attention mechanisms to predict accurate trajectories. SemanticFormer comprises a hierarchical heterogeneous graph encoder to capture spatio-temporal and relational information across agents as well as between agents and road elements. Further, it includes a predictor to fuse different encodings and decode trajectories with probabilities. Finally, a refinement module assesses permitted meta-paths of trajectories and speed profiles to obtain final predicted trajectories. Evaluation of the nuScenes benchmark demonstrates improved performance compared to several SOTA methods. In addition, we demonstrate that our knowledge graph can be easily added to two graph-based existing SOTA methods, namely VectorNet and LaFormer, replacing their original homogeneous graphs. The evaluation results suggest that by adding our knowledge graph the performance of the original methods is enhanced by 5% and 4%, respectively. Graph data is available at https://github.com/boschresearch/nuScenes_Knowledge_Graph

I INTRODUCTION

Autonomous vehicles are recognized as a promising solution to address critical challenges such as road safety, traffic congestion, and energy optimization. A crucial task towards the realization of autonomous driving vision is motion prediction[1]. It involves determining a set of spatial coordinates that represent the predicted movement of a given agent within a future time window. However, motion prediction is a challenging task due to various contextual factors such as the difficulty of intention prediction, the complex interactions of traffic participants, the intricate road topology, comprising lanes, lane dividers, and pedestrian crossings, as well as adherence to traffic regulations. State-of-the-art approaches utilize various representations for traffic scenes such as raster-based [2, 3], or graph-based [4, 5] to capture and utilize contextual information sufficiently.

Refer to caption — Figure 1: Driving scenes represented in a heterogeneous graph capturing all relevant map details, traffic agents, and their semantic relationships.

Recent work applies a knowledge graph (KG) to encode diverse contextual information from traffic scenes [6]. Figure 1 illustrates various types of elements comprised in a typical traffic scene including different entities and their relations along with their semantic descriptions. We propose a novel approach that leverages heterogeneous information of static and dynamic elements modeled in the KG. It contains an attention mechanism for consuming semantic relationships and dependencies between traffic agents and road elements for accurate multimodal trajectory prediction. Main contributions:

•

A knowledge graph based approach to encode all relevant static and dynamic elements of a traffic scene with their semantic relationships.
•

A hybrid architecture with attention mechanisms to model the semantic relationships and dependencies between traffic agents and road elements for accurate multi-modal trajectory prediction. Evaluated on nuScenes dataset [7].
•

Dedicated experiments to demonstrate the easiness of incorporating our KG into existing graph-based trajectory prediction models. Concretely, we integrate the KG into VectorNet [5] and LaFormer [8] (changing GIG block). The evaluation results show that incorporating KG with VectorNet and LaFormer helps improve their ADE performance by 5% and 4%, respectively.

II RELATED WORK

Representation. Early methods for trajectory prediction use raster-based birds-eye-view representations of the map and agents encoding them with a number of channels for different information sources [9, 10]. These methods are extended to predict multiple trajectories with associated probabilities[2, 3]. Others aim to estimate probability distribution heat maps representing locations where agents could be located at a fixed time horizon [11, 12]. However, these models usually do not have access to high-level information and need to learn complex relationships from raw pixels.

Graph-based approaches represent scenes as vectors, polylines and graphs and thus operate at a higher level of abstraction [5, 13, 14, 15, 16, 8]. VectorNet [5] encodes both map features and agent trajectories as polylines and then merges them with a global interaction graph. TNT [17] extends VectorNet and combines it with multiple target reference trajectory proposals sampled from the lanes to diversify the prediction points. Unfortunately, these techniques usually use homogeneous graphs with one entity type and one relation type which prevents them from representing the rich heterogeneous traffic scene along with their complex relations.

Methods that use heterogeneous graphs, i.e. graphs with different entity types such as vehicles, bicycles or pedestrians and relation types like agent-to-lane or agent-to-agent, are recently proposed [18, 19, 20, 21, 22, 23]. However, they are limited to only a portion of the relevant information and are unable to fully capture all scene details and the interconnections between the entities. Our approach aims to fill this gap using formal ontologies for constructing a knowledge graph [24, 25, 26, 27, 28] capturing the rich information of traffic scenes. Knowledge graphs have been applied in other automotive applications like POI recommendation [29, 30] and driving situation understanding [31].

Encoding. Early encodings are based on CNNs [32, 33, 2, 10], while more recent works use GNNs [14, 5, 13, 16, 15]. Attention mechanisms have recently attracted high interest in modeling the interactive behavior between agents for raster-based approaches [34, 35, 36, 37, 38], graph-based approaches [39, 40, 41, 42, 43, 44] and map-free approaches [45]. A hierarchical vector transformer-based approach, HiVTHV is presented in [46] that consists of a local context feature encoding followed by the global message passing among agent-centric local regions. Autoregressive trajectory prediction approaches generating trajectories at intervals to produce scene-consistent multi-agent trajectories are proposed in [47, 34, 48, 49, 37]. Based on language modeling concepts with transformers, MotionLM [50] treats continuous trajectories as sequences of discrete motion tokens and cast multi-agent motion prediction as a language modeling task. In [51], a pretrained language model is used to encode text describing traffic situations combined with raster-based encodings. A game-theoretic modeling and learning approach considering relations between scene elements, alongside a novel hierarchical transformer decoder architecture is presented in [52]. We also use a transformer-based architecture but encode different information sources including map topology, meta-paths, as well as relational information.

Predicting. Goal- or intention conditioned systems sample goal candidates and predict trajectories conditioned on them [53, 54, 44, 17, 55]. Grid-based policy learning via maximum entropy inverse reinforcement learning is used in [56] to condition trajectory forecasts. Authors in [57] use key-frames as representative states to trace out the general direction of the trajectory. Approaches considering lane-aware scene constraints that align motion dynamics with scene information are shown in [8, 58]. Our architecture is related, but we use a heterogeneous graph transformer to process the heterogeneous information of the KG. Others use anchors, fixed sets of anchor trajectories corresponding to permitted trajectories, to guide trajectory prediction [32, 3, 22, 59]. [15] presents a method to learn latent representations of anchor trajectories. Query-centric trajectory prediction is proposed in [60, 61], where agents’ decisions are formulated as information queries using the available information before they make a decision. Our approach is related but refines anchors into meta-paths by using contextual information.

III METHODOLOGY

We aim to represent all relevant information that governs the behavior of traffic participants. For example, information about lane dividers (e.g. dashed line, solid line), conveys information about permitted lane changes and is therefore important for trajectory prediction; a pedestrian crossing together with the distance and direction of nearby pedestrians governs the behavior of oncoming vehicles. As seen below, it is not only important to represent all relevant information but also their relational information. We address this challenge by representing the map and agents with a knowledge graph. This enables us to explicitly model the various map elements and agents as well as their semantic relations. It also allows for the modeling of diverse traffic agents types like cars, and bicycles, and their relations in driving situations such as whether two agents might interact, or drive behind or next to one another.

In the following, we describe a comprehensive architecture depicted in Figure 2, which uses a knowledge graph for predicting multimodal trajectories. The architecture begins by taking the scene graph $g_{i}$ as input and outputs multimodal trajectories for the target agent. Finally, the refinement module filters the predicted trajectories, considering anchor paths and speed profiles to avoid failure cases.

III-A Ontology and Heterogeneous Scene Graph

We utilize ontologies to explicitly represent the abundance of information from traffic scenes. Thus, based on the domain knowledge we model relationships between entities considered important for the trajectory prediction task. Figure 3 illustrates the developed ontologies, encompassing various entity and relation types. The entity types are categorized into two groups: the first one contains static entities like lane types, boundaries, center lines and stop areas; the second group contains dynamic entities like agents, their states, and bounding boxes. As for relation types, they fall into three groups: 1) between agents, which construct semantic associations such as lateral, longitudinal, and intersecting, as shown in Figure 4(b) akin to the concepts presented in [62]; 2) between map elements, establishing lane connectivity and relationships between lanes and road infrastructure elements like stop areas, traffic lights, pedestrian crossings; and 3) relations between map elements and agents, utilizing probability projection to map agents onto road infrastructure. Based on the designed ontology, we represent the scene by a heterogeneous scene graph $G=(V,E,\tau,\phi)$ . It has nodes $v\in V$ , their types $\tau(v)$ , and edges $(u,v)\in E$ , with edge types $\phi(u,v)$ . The edges are directed since they are based on properties of the knowledge graph.

III-B Problem Formulation for Trajectory Prediction

We assume that the perception part can provide detailed information about agent positions, and past motion as well as the HD map, so we build the scene graph as described in the previous section. Then, a sample of the dataset can be formed as $\left(g_{i},y_{i}\right)$ where $g_{i}$ is a sample scene graph with trajectory information, local map, and target identifier and $y_{i}$ is the ground truth future trajectory of the given target. Both agent past trajectories and map information are represented hierarchically. Further, $g_{i}\in G$ covers the information within a chosen time horizon $\left\{-t_{h}+1,\cdots,0,1,\cdots,t_{f}\right\}$ . We use $\mathbf{P}_{-t_{h}+1:0}^{i}=\left\{sp_{-t_{h}+2}^{i},sp_{-t_{h}+3}^{i},\ldots,% sp_{0}^{i}\right\}$ to represent respective scene participant nodes. Each participant node $sp_{t}^{i}$ is modeled as $sp_{t}^{i}=\left[d_{t,s}^{i},d_{t,e}^{i},a^{i}\right]$ , where $d_{t,s}^{i}$ and $d_{t,e}^{i}$ stands for previous and current time stamps participant locations, whereas $a^{i}$ represent additional attributes like velocity, acceleration, heading change rate and the object type. For map information we use $\mathbf{S}_{1:N}^{i}=\left\{s_{1}^{i},s_{2}^{i},\ldots,s_{N}^{i}\right\}$ to denote a lane snippet, $s_{n}^{i}$ for lane slices and $N$ the length of the given lane snippet. Each lane slice vector $s_{n}^{i}=\left[d_{n,s}^{i},d_{n,e}^{i},a_{i},d_{n,\mathrm{pre}}^{i}\right]$ adds $d_{n,\mathrm{pre}}^{i}$ to indicate the predecessor of the starting point. Connections between lane snippets are built by lane connectors $\mathbf{C}_{1:N}^{i}=\left\{c_{1}^{i},c_{2}^{i},\ldots,c_{N}^{i}\right\}$ , where each $c_{n}^{i}$ encodes an ordered pose inside the lane connector of length $N$ .

Coordinates in the knowledge graph are initially in a global coordinate system. These are then transformed separately into local, scene graph-specific coordinates, with the origin at the location of the target agent and the positive y-axis pointing along the facing direction of the target.

III-C Semantic Scene Graph Hierarchical Modeling

III-C1 Meta-Path Generation

We extract meta-paths to describe permitted and possible driving directions to navigate the target participant. Meta-paths related to the permitted lane changes and turns can be divided into three groups: 1) lane-changing; 2) entering the lane connector; and 3) leaving the lane connector.

Figure 4(a) gives a qualitative analysis of generated meta-paths. Specifically, we illustrate sample meta-paths below, such as lane-changing 1, leaving connector 2, and entering connector cases 3, where $\Phi$ represents the meta-path.

\Phi_{0}=\mathbf{P}\stackrel{{\scriptstyle isOn}}{{\longrightarrow}}\mathbf{S}% \stackrel{{\scriptstyle switchViaX}}{{\longrightarrow}}\mathbf{S}\stackrel{{% \scriptstyle switchViaX}}{{\longrightarrow}}\mathbf{S}

(1)

\Phi_{1}=\mathbf{P}\stackrel{{\scriptstyle isOn}}{{\longrightarrow}}\mathbf{C}% \stackrel{{\scriptstyle CconnectS}}{{\longrightarrow}}\mathbf{S}\stackrel{{% \scriptstyle switchViaX}}{{\longrightarrow}}\mathbf{S}

(2)

\Phi_{2}=\mathbf{P}\stackrel{{\scriptstyle isOn}}{{\longrightarrow}}\mathbf{S}% \stackrel{{\scriptstyle switchViaX}}{{\longrightarrow}}\mathbf{S}\stackrel{{% \scriptstyle SconnectC}}{{\longrightarrow}}\mathbf{C}

(3)

III-C2 Agent Motion and Lane Encoder

This component is responsible for encoding spatio-temporal information. We process participants $\mathbf{P}^{i}$ , lane snippets $\mathbf{S}_{1:N}^{i}$ , and lane connectors $\mathbf{C}_{1:N}^{i}$ in a sequential manner using both a Graph Neural Network (GNN) and a Gated Recurrent Unit (GRU) layer. Their respective encodings are represented by $p_{i}$ , $s_{j}$ , and $c_{z}$ . Further, inspired by LaneGCN [13], we merge the outcomes as shown in Figure 5. Equation 4 introduces lane information to the related agents while equation 5 and equation 6 add participant information to the related lanes and lane connectors.

p_{i}=p_{i}+\operatorname{CrossAtt}\left\{p_{i},[s_{j},c_{z}]\right\}

(4)

s_{j}=s_{j}+\operatorname{CrossAtt}\left\{s_{j},p_{i}\right\}

(5)

c_{z}=c_{z}+\operatorname{CrossAtt}\left\{c_{z},p_{i}\right\}

(6)

where $i\in\left\{1,\ldots,N_{\text{P}}\right\},j\in\left\{1,\ldots,N_{\text{LS }}% \right\},z\in\left\{1,\ldots,N_{\text{LC }}\right\}$ . Encodings are assigned to node attributes in scene graph $g_{i}$ .

III-C3 Scene Graph Encoder

A heterogeneous graph operator is used to reason over the given scene graph $g_{i}$ . To better incorporate the generated meta-paths, we follow the principle from HAN [63] i.e. using a hierarchical attention structure from node-level attention to semantic-level attention as shown in figure 6.

Applying HAN to learn relational information is shown in Algorithm 1. Three distinct node types are used for the probability predictor to encode participants, lane snippets, and lane connectors. We use $p_{i}$ , $s_{j}$ , $c_{z}$ to represent these three types respectively, where $p_{i}\in Z$ , $s_{j}\in Z$ , $c_{z}\in Z$ .

input : Heterogeneous scene graph

G=(V,E,\tau,\phi)

Node feature

\left\{h_{i},\forall i\in V,h\in\{p,s,c\}\right\}

Meta-path set

\left\{\Phi_{0},\Phi_{1},\ldots,\Phi_{P}\right\}

Number of attention head

K

output : Heterogeneous graph node embedding

Z

1 for $\Phi_{i}\in\left\{\Phi_{0},\Phi_{1},\ldots,\Phi_{P}\right\}$ do

2 for $k=1\ldots K$ do

3 Type-specific transformation

\mathrm{h}_{i}^{\prime}\leftarrow\text{MLP}\{\mathrm{h}_{i}

}

4 for $i\in V$ do

5 Find the meta-path based neighbors

N_{i}^{\Phi}

6 for $j\in N_{i}^{\Phi}$ do

7 Calculate the weight coefficient

\alpha_{ij}^{\Phi}

8 end for

10 Calculate the semantic-specific node embedding

\mathrm{z}_{i}^{\Phi}\leftarrow\sigma\left(\sum_{j\in N_{i}^{\Phi}}\alpha_{ij}% ^{\Phi}\cdot\mathbf{h}_{j}^{\prime}\right)

11 end for

12 Concatenate the learned embeddings from all attention head

\mathrm{z}_{i}^{\Phi}\leftarrow\|_{k=1}^{K}\sigma\left(\sum_{j\in N_{i}^{\Phi}% }\alpha_{ij}^{\Phi}\cdot\mathbf{h}_{j}^{\prime}\right)

13 end for

14 Calculate the weight of meta-path

\beta_{\Phi_{i}}

Fuse the semantic-specific embedding

Z\leftarrow\sum_{i=1}^{P}\beta_{\Phi_{i}}\cdot Z_{\Phi_{i}}

15 end for

return $Z$

Algorithm 1 Semantic Graph Learning via HAN

III-C4 Probability Predictor

As a result of the scene graph encoder, nodes of lane snippets $s_{i}$ and lane connectors $c_{i}$ are projected to the same dimension $Z$ . We treat these two types of nodes as the same type and use $l_{i}$ to represent them. Inspired by LAFormer [8], we align the target agent motion and lane information at each future time step $t\in\{1,\ldots,t_{f}\}$ . To achieve this, we use a lane score head and an attention mechanism to predict lane encoding probabilities. In the attention mechanism, key ( $K$ ) and value ( $V$ ) vectors are produced by $MLP(p_{i})$ , whereas the query ( $Q$ ) is produced by $MLP(l_{i})$ . Next, attention encodings are calculated by $A_{i,j}=\operatorname{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V$ . The predicted score of the $j\text{th}$ lane encodings at $t$ is shown in equation 7, where $\phi$ denotes MLP layers. We select top-k lane encodings to maintain the uncertainty and concatenate the candidate lane segments and associated scores over the future time steps to obtain $L=\operatorname{ConCat}\left\{l_{1:k},\hat{s}_{1:k}\right\}_{t=1}^{t_{f}}$ .

\hat{s}_{j,t}=\frac{\exp\left(\phi\left\{p_{i},l_{j},A_{i,j}\right\}\right)}{% \sum_{n=1}^{N_{\text{lane}\in\Phi_{j}}}\exp\left(\phi\left\{h_{i},l_{n},A_{i,n% }\right\}\right)},

(7)

To optimize the probability estimation, we use a binary cross-entropy loss $\mathcal{L}_{\text{lane }}$ , as shown in equation 8. Ground truth lane segment $s_{t}$ relies on the isOn relationship in the knowledge graph. Next, a cross-attention operation is performed to further fuse agent and lane information. Key and value vectors are $L$ , query vector is $p_{i}$ . The updated lane output is $l_{i,\mathrm{att}}$ .

\mathcal{L}_{\text{lane }}=\sum_{t=1}^{t_{\mathrm{f}}}\mathcal{L}_{\mathrm{CE}% }\left(s_{t},\hat{s}_{t}\right)

(8)

Then we employ a predictor for generating multimodal trajectories. This is realized by sampling a latent vector $z$ from a multivariate normal distribution and adding it to the fusion encodings. Next, a Laplacian mixture density network (MDN) decoder is used to output a set of trajectories $\sum_{m=1}^{M}\hat{\pi}_{m}\operatorname{Laplace}(\mu,b)$ . $\hat{\pi}_{m}$ denotes the probability of each mode and $\sum_{m=1}^{M}\hat{\pi}_{m}=1$ . $\mu$ and $b$ represent the location and scale parameters of each Laplace component. We use an MLP to predict $\hat{\pi}_{m}$ , a GRU to recover the time dimension $t_{\mathrm{f}}$ of the predictions, and two MLPs to predict $\mu$ and $b$ . The predictor is trained by minimizing a regression loss and a classification loss. Regression loss is computed using the Winner-Takes-All strategy as shown in equation 9.

\mathcal{L}_{\mathrm{reg}}=\frac{1}{t_{f}}\sum_{t=1}^{t_{f}}-\log P\left(Y_{t}% \mid\mu_{t}^{m^{*}},b_{t}^{m^{*}}\right)

(9)

where $Y$ is the ground truth position and $m^{*}$ represents the best mode which has minimum $L_{2}$ error among the $M$ predictions. Cross-entropy loss is used to optimize the mode classification as shown in equation 10.

\mathcal{L}_{\mathrm{cls}}=\sum_{m=1}^{M}-\pi_{m}\log\left(\hat{\pi}_{m}\right).

(10)

Several metrics are used to evaluate the deviation from the ground truth, like velocity loss and angle loss, and investigate the influence of different measurements on the predictions. For the velocity loss, we calculate the ground truth velocity traces $V_{t}=\|Y_{t}-Y_{t-1}\|_{2}$ and prediction velocity traces $\hat{V_{t}}=\|\mu_{t}-\mu_{t-1}\|_{2}$ , then velocity loss is shown in equation 11.

\mathcal{L}_{\mathrm{velocity}}=\frac{1}{t_{f}}\sum_{t=1}^{t_{f}}-\log P\left(% V_{t}\mid\hat{V_{t}}^{m^{*}},b_{t}^{m^{*}}\right)

(11)

For the angle loss, $X_{0}$ is used to denote the initial position and we calculate ground truth angle $\theta_{t}=\arctan 2\left(Y_{t}-X_{0}\right)$ and prediction angle $\hat{\theta}_{t}=\arctan 2\left(\mu_{t}-X_{0}\right)$ . The following equation 12 shows the calculation of the loss:

\mathcal{L}_{\text{angle }}=\frac{1}{t_{f}}\sum_{t=1}^{t_{f}}-\cos\left(\hat{% \theta}_{t}-\theta_{t}\right)

(12)

The total loss for the motion prediction is given by 13.

\mathcal{L}=\lambda_{1}\mathcal{L}_{\text{lane }}+\lambda_{2}\mathcal{L}_{% \text{velocity}}+\lambda_{3}\mathcal{L}_{\text{angle}}+\mathcal{L}_{\text{reg}% }+\mathcal{L}_{\mathrm{cls}}

(13)

III-D Prediction Refinement

To filter out the unreasonable predictions, we analyze the predicted trajectories by anchor paths [59]. Anchor paths provide possible and permitted trajectories for an agent at a given position in the road network. Anchor paths are used to filter out trajectory candidates far from these anchor paths. Next, we cluster the remaining trajectory candidates w.r.t. their speed profiles and keep the top candidates closest to the cluster centers. For an unfair comparison, we also perform experiments using the ground truth speed profile to get an idea about the relevance of the speed component in the prediction results. Details are shown in Algorithm 2.

input : Predictions

\left\{\mu_{1:t_{f}}^{1},\mu_{1:t_{f}}^{2},\ldots,\mu_{1:t_{f}}^{k}\right\}

Predicted Probabilities

\{\pi_{1},\pi_{2},\ldots,\pi_{k}\}

Anchor Paths

\{P_{1},P_{2},\ldots P_{5}\}

output : Filtered Predictions

\left\{\hat{Y}_{1:t_{f}}^{1},\hat{Y}_{1:t_{f}}^{2},\ldots,\hat{Y}_{1:t_{f}}^{5% }\right\}

2 if Ground Truth speed profile $s_{gt}$ available then

3 Calculate the speed profiles

s_{1}

s_{2}

, …,

s_{k}

4 Calculate similarity to

s_{gt}

using Dynamic Time War** (DTW)

5 Select 5 most similar predictions

\left\{\hat{Y}_{1:t_{f}}^{1},\hat{Y}_{1:t_{f}}^{2},\ldots,\hat{Y}_{1:t_{f}}^{5% }\right\}

6 end if

7else

8 for $P_{i}\in\{P_{1},P_{2},\ldots P_{5}\}$ do

9 for $\mu_{1:t_{f}}^{j}\in\left\{\mu_{1:t_{f}}^{1},\mu_{1:t_{f}}^{2},\ldots,\mu_{1:t% _{f}}^{k}\right\}$ do

10 Calculate the distance

d_{ij}

between

P_{i}

and

\mu_{1:t_{f}}^{j}

11 end for

12 For each

i

, select the

min_{5}d_{ij}

and calculate the speed profiles

s_{i1}

s_{i2}

s_{i3}

s_{i4}

s_{i5}

13 Cluster speed profiles

s_{ij}

using K-means and output the prediction

\hat{Y}_{1:t_{f}}^{i}

closest to the cluster centers.

14 end for

16 end if

return $\left\{\hat{Y}_{1:t_{f}}^{1},\hat{Y}_{1:t_{f}}^{2},\ldots,\hat{Y}_{1:t_{f}}^{5% }\right\}\subseteq\left\{\mu_{1:t_{f}}^{1},\mu_{1:t_{f}}^{2},\ldots,\mu_{1:t_{% f}}^{k}\right\}$

Algorithm 2 Prediction Refinement

IV EXPERIMENTS

IV-A Dataset & nuScenes Knowledge Graph

The nuScenes dataset [7] is a popular dataset for self-driving cars that is gathered in Boston and Singapore. It encompasses 1000 scenes, each lasting 20 seconds, and includes meticulously annotated ground truth details along with high-definition (HD) maps. The vehicles within this dataset have 3D bounding boxes manually annotated and published at a rate of 2 Hz. For the prediction task, the objective involves leveraging the preceding 2 seconds of object history and the map data to forecast the subsequent 6 seconds. We adhere to the standard split provided by the nuScenes benchmark description. Using our proposed ontology to the nuScenes dataset, we generate the nuScenes Knowledge Graph including agent and map information as described in [6]. Features are provided by the upstream perception components and the HD map from the nuScenes dataset. Table I and II list the used feature sets for each node type and each relation type. All features that express a category type are one-hot encoded.

TABLE I: Node Type Features

View	Node type	Features
Agent	SceneParticipant	Orientation, State, Position, Velocity, Acceleration, Heading Change, Distance to Centerline
Agent	Participant	Type, Size
	Sequence	Timestamp
	Scene	-
Map	LaneSnippet	Length
	LaneSlice	Width, Center Pose
	LaneConnector	-
	OrderedPose	Center Pose
	Lane	-
	CarparkArea	-
	Walkway	-
	Intersection	-
	PedCrossingStopArea	-
	StopSignArea	-
	TrafficLightStopArea	-
	TurnStopArea	-
	YieldStopAre	-

TABLE II: Relation Type Features

View	Relation type	Features
Agent	hasSceneParticipant	-
	inNextScene	Time Elapsed
	hasNextScene	Time Elapsed
	hasPreviousScene	Time Elapsed
	isSceneParticipant	-
Map	switchViaDoubleDashedWhite	-
	switchViaRoadDivider	-
	switchViaSingleZigzagWhite	-
	switchViaDoubleSolidWhite	-
	switchViaSingleSolidYellow	-
	switchViaSingleSolidWhite	-
	isSlice/PoseOnStopArea	-
	connectsIncoming/Outgoing	-
	hasNextLane/Snippet/Slice	-
Interaction	isOnMapElement	Probability
	relatedLongitudinal	Path/Distance
	relatedLateral	Path/Distance
	relatedIntersecting	Path/Distance
	relatedPedestrian	Distance

IV-B Metrics

We utilize standard evaluation metrics to assess the prediction performance, specifically employing $ADE_{K}$ (Average Displacement Error for $K$ modes) and $FDE_{K}$ (Final Displacement Error for $K$ modes). These metrics gauge $L_{2}$ errors, both at the final step and averaged across each step for predicting $K$ modes. The reported minimum error among the $K$ modes is considered. Both ADE and FDE are measured in meters. Additionally, the miss rate $MR_{K}$ calculates the percentage of scenarios where the final-step error exceeds 2 meters.

IV-C Model Implementation

The hidden dimension of vectors in the pipeline is set to 32. The layer of the heterogeneous graph neural network is set to 1 and sum is used as the aggregation method. The attention head in HAN is set to 8 whereas values for parameters of equation 13, $\lambda_{1}$ , $\lambda_{2}$ and $\lambda_{3}$ are set to 0.95, 1, and 1, respectively.

We use all agent and map elements within the four closest roadblocks. The coordinate system in the model is the BEV centered at the agent location at $t=0$ . We use the orientation from the agent location at $t=-1$ to the agent location at $t=0$ as the positive x-axis. The model is trained on a TESLA-V100 GPU, with a batch size of 32, and the Adam optimizer with an initial learning rate of $1\times 10^{-3}$ decayed by 0.7 per 5 epochs.

IV-D Quantitative Results

We compare our results on the nuScenes online benchmark as shown in Table III. The SemanticFormer method predicts directly 5 trajectories without prediction refinement, whereas its extension, SemanticFormerR, predicts 25 trajectories and then refines those predictions. As can be observed, SemanticFormerR achieves competitive performance, thus indicating the benefit of leveraging complex and heterogeneous scene information represented in the Knowledge Graph. Also, it suggests that the speed profiles have a huge impact on future trajectories. In an unfair comparison, utilizing ground truth speed followed by Algorithm 2, SemanticFormerR demonstrates a significant superiority over state-of-the-art methods.

TABLE III: Performance Table on nuScenes Benchmark

Method	GT Speed	K=1 FDE	K=5		K=10
Method	GT Speed	K=1 FDE	ADE	MR	ADE	MR
CoverNet [3]	$\times$	11.36	1.96	0.67	1.48	-
Trajectron++ [49]	$\times$	9.52	1.88	0.70	1.51	0.57
LaPred [64]	$\times$	8.37	1.47	0.53	1.12	0.46
P2T [56]	$\times$	10.50	1.45	0.64	1.16	0.46
LaneGCN [13]	$\times$	-	-	0.49	0.95	0.36
GOHOME [65]	$\times$	6.99	1.42	0.57	1.15	0.47
Autobot [38]	$\times$	8.19	1.37	0.62	1.03	0.44
THOMAS [12]	$\times$	6.71	1.33	0.55	1.04	-
PGP [66]	$\times$	7.17	1.30	0.61	1.00	0.37
LaFormer [8]	$\times$	6.95	1.19	0.48	0.93	0.33
Socialea [61]	$\times$	6.77	1.18	0.48	1.02	0.44
FRM [58]	$\times$	6.59	1.18	0.48	0.88	0.30
SemanticFormer	$\times$	6.29	1.15	0.48	0.91	0.31
SemanticFormerR	$\times$	6.27	1.14	0.50	0.87	0.30
DMAP [59]	$\checkmark$	-	1.09	0.19	1.07	0.18
SemanticFormerR	$\checkmark$	3.88	0.86	0.26	0.78	0.13

IV-E Ablation study

IV-E1 Effect of Topological Structure of Heterogeneous Graph

Knowledge graph provides explicit and logical relationships between different heterogeneous nodes. We study the performance improvement compared to fully connected or unconnected graph structure as shown in Table IV.

TABLE IV: Ablation Study of Graph Topological Structure

Graph Topology	Edge Types	K=5
Graph Topology	Edge Types	ADE	FDE
Knowledge Graph	46	1.15	2.20
Fully Connected Graph	1	1.19	2.31
Fully Unconnected Graph	0	1.24	2.46

IV-E2 Effect of Individual Components

Our proposed heterogeneous graph is mainly composed of four parts which are map topology, meta-paths, agent-map relationships, and agent-agent relationships. We investigate the impact of drop** certain inputs to the model as shown in Table V.

TABLE V: Ablation Study for Graph Components

Meta- Paths	Map- Topology	Agent- Map	Agent- Agent	K=5
Meta- Paths	Map- Topology	Agent- Map	Agent- Agent	ADE	FDE
$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	1.15	2.20
$\times$	$\checkmark$	$\checkmark$	$\checkmark$	1.18	2.29
$\checkmark$	$\times$	$\checkmark$	$\checkmark$	1.17	2.26
$\times$	$\times$	$\checkmark$	$\checkmark$	1.22	2.39
$\times$	$\times$	$\times$	$\checkmark$	1.23	2.42
$\times$	$\times$	$\times$	$\times$	1.24	2.46

IV-E3 Integration to other Models

We integrate our proposed Knowledge Graph into other graph-based models like VectorNet and LaFormer. Table VI shows the experimental results indicating that the Knowledge Graph can effectively improve the performance of the chosen methods.

TABLE VI: Ablation Study for Integrating other Architectures

Architectures	ADE_5	FDE_1	OffRoadRate
VectorNet [5]	1.34	7.98	0.04
VectorNet + KG	1.26	7.55	0.03
LaFormer [8]	1.19	6.95	0.02
LaFormer + KG	1.15	6.29	0.02

IV-E4 Effect of Heterogeneous Graph Operators

We analyze different heterogeneous graph operators like HGT [67] and HAN [63]. As shown in Table VII, to prevent overfitting, we merge sub-classes like single solid, double solid, etc, to switchViaPermitted and switchViaNonPermitted relationships. *N means number of layers of operator is N.

TABLE VII: Ablation Study for HGNN Operators

Interaction Graph	Oper- ators	Self Loop	Meta Path	K=5
Interaction Graph	Oper- ators	Self Loop	Meta Path	ADE	FDE
Original	$\mathrm{HGT}^{*}2$	$\checkmark$	$\times$	1.24	2.49
Compact	$\mathrm{HGT}^{*}2$	$\checkmark$	$\times$	1.24	2.46
Compact	$\mathrm{HGT}^{*}2$	$\times$	$\times$	1.22	2.38
Compact	$\mathrm{HAN}^{*}2$	$\times$	$\checkmark$	1.19	2.34
Compact	$\mathrm{HAN}^{*}1$	$\times$	$\checkmark$	1.15	2.20

IV-F Qualitative results

A qualitative visualization of our predictions is depicted in Figure 7. Green trajectories are ground truth and red trajectories are five predictions. Row 1 shows predictions considering all driving path possibilities and row 2 captures the lane-changing situation successfully.

V CONCLUSIONS

This paper proposes a novel approach using a traffic scene knowledge graph leveraging past trajectories and an HD map as input for predicting a set of multimodal trajectories. A scene graph encoder module aims to capture the interactions in a traffic scene from four aspects, agent-agent interaction, agent-map interaction, map-map interaction, and meta-paths interaction. Further, the refinement module considers the typical speed profiles and anchor paths to refine trajectory candidates. Our approach achieves excellent results compared to the state-of-the-art model, We also provide an experimental justification of our approach by performing experiments with two SOTA methods, i.e. LaFormer and VectorNet, and replacing their original homogeneous graphs with our Knowledge Graph. We show that the Knowledge Graph improves the performance of those methods by 5% and 4%, respectively. Moreover, extensive ablation and sensitivity studies also indicate that our proposed Knowledge Graph can be easily integrated into other graph-based methods to improve performance. Future work will focus on extending the Knowledge Graph with additional information such as traffic rules, traffic signs, and forms of driving common sense knowledge.

References

[1] C. Ju, Z. Wang, C. Long, X. Zhang, and D. E. Chang, “Interaction-aware kalman neural networks for trajectory prediction,” in 2020 IEEE Intelligent Vehicles Symposium (IV), 2020, pp. 1793–1800.
[2] H. Cui, V. Radosavljevic, F.-C. Chou, T.-H. Lin, et al., “Multimodal trajectory predictions for autonomous driving using deep convolutional networks,” ICRA, pp. 2090–2096, 2019.
[3] T. Phan-Minh, E. C. Grigore, F. A. Boulton, et al., “CoverNet: Multimodal behavior prediction using trajectory sets,” IEEE/CVF CVPR, 2019.
[4] J. Li, F. Yang, M. Tomizuka, et al., “EvolveGraph: Multi-agent trajectory prediction with dynamic relational reasoning,” NeurIPS, 2020.
[5] J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid, “VectorNet: Encoding hd maps and agent dynamics from vectorized representation,” 2020 IEEE/CVF CVPR, pp. 11 522–11 530, 2020.
[6] L. Mlodzian, Z. Sun, H. Berkemeyer, S. Monka, Z. Wang, S. Dietze, L. Halilaj, and J. Luettin, “nuScenes knowledge graph - A comprehensive semantic representation of traffic scenes for trajectory prediction,” in IEEE/CVF ICCV 2023 - Workshops, 2023, pp. 42–52.
[7] H. Caesar, V. Bankiti, A. H. Lang, et al., “nuScenes: A multimodal dataset for autonomous driving,” in IEEE/CVF CVPR, 2020.
[8] M. Liu, H. Cheng, L. Chen, H. Broszio, J. Li, R. Zhao, M. Sester, and M. Y. Yang, “LAformer: Trajectory prediction for autonomous driving with lane-aware scene constraints,” in IEEE/CVF CVPR, 2024.
[9] N. Djuric, V. Radosavljevic, H. Cui, T. Nguyen, F.-C. Chou, T.-H. Lin, et al., “Uncertainty-aware short-term motion prediction of traffic actors for autonomous driving,” 2020 IEEE WACV, pp. 2084–2093, 2018.
[10] J. Hong, B. Sapp, and J. Philbin, “Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions,” 2019 IEEE/CVF CVPR, pp. 8446–8454, 2019.
[11] T. Gilles, S. Sabatini, D. Tsishkou, B. Stanciulescu, and F. Moutarde, “HOME: Heatmap output for future motion estimation,” in ITSC, 2021.
[12] ——, “THOMAS: trajectory heatmap output with learned multi-agent sampling,” ICLR, 2022.
[13] M. Liang, B. Yang, R. Hu, Y. Chen, R. Liao, et al., “Learning lane graph representations for motion forecasting,” ECCV, 2020.
[14] S. Casas, C. Gulino, R. Liao, and R. Urtasun, “SpAGNN: Spatially-aware graph neural networks for relational behavior forecasting from sensor data,” ICRA, pp. 9491–9497, 2019.
[15] B. Varadarajan, A. S. Hefny, A. Srivastava, K. S. Refaat, et al., “Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,” ICRA, pp. 7814–7821, 2021.
[16] S. Konev, “MPA: Multipath++ based architecture for motion prediction,” IEEE/CVF CVPR Workshop on Autonomous Driving, 2022.
[17] H. Zhao, J. Gao, T. Lan, et al., “TNT: Target-driven trajectory prediction,” in Conference on Robot Learning, 2020.
[18] X. Mo, Z. Huang, Y. Xing, and C. Lv, “Multi-agent trajectory prediction with heterogeneous edge-enhanced graph attention network,” IEEE Transactions on ITS, vol. 23, pp. 9554–9567, 2022.
[19] X. Jia, P. Wu, L. Chen, Y. Liu, H. Li, and J. Yan, “HDGT: heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding,” IEEE Trans. PAMI, vol. 45, no. 11, pp. 13 860–13 875, 2023.
[20] T. Monninger, J. Schmidt, J. Rupprecht, D. Raba, et al., “SCENE: Reasoning about traffic scenes using heterogeneous graph neural networks,” IEEE Robotics and Automation Letters, vol. 8, no. 3, 2023.
[21] S. Wonsak, M. Al-Rifai, M. Nolting, and W. Nejdl, “Multi-modal motion prediction with graphormers,” in ITSC. IEEE, 2022.
[22] D. Grimm, M. Zipfl, F. Hertlein, A. Naumann, J. Luettin, S. Thoma, S. Schmid, L. Halilaj, A. Rettinger, and J. M. Zöllner, “Heterogeneous graph-based trajectory prediction using local map context and social interactions,” IEEE ITSC, pp. 2901–2907, 2023.
[23] Z. Wang, Z. Sun, J. Luettin, and L. Halilaj, “SocialFormer: Social interaction modeling with edge-enhanced heterogeneous graph transformers for trajectory prediction,” 2024. [Online]. Available: https://arxiv.longhoe.net/abs/2405.03809
[24] L. Halilaj, J. Luettin, C. A. Henson, and S. Monka, “Knowledge graphs for automated driving,” IEEE AIKE, pp. 98–105, 2022.
[25] L. Halilaj, J. Luettin, S. Monka, C. A. Henson, and S. Schmid, “Knowledge graph-based integration of autonomous driving datasets,” Int. J. Semantic Comput., vol. 17, pp. 249–271, 2023.
[26] J. Luettin, S. Monka, C. A. Henson, and L. Halilaj, “A survey on knowledge graph-based methods for automated driving,” in Knowledge Graphs and Semantic Web, KGSWC. Springer, 2022, pp. 16–31.
[27] S. Xiong, Y. Yang, F. Fekri, and J. C. Kerce, “TILP: Differentiable learning of temporal logical rules on knowledge graphs,” arXiv preprint arXiv:2402.12309, 2024.
[28] S. Xiong, A. Payani, R. Kompella, and F. Fekri, “Large language models can learn temporal reasoning,” arXiv preprint arXiv:2401.06853, 2024.
[29] L. Halilaj, J. Luettin, S. Rothermel, S. K. Arumugam, and I. Dindorkar, “Towards a knowledge graph-based approach for context-aware points-of-interest recommendations,” ACM SAC, 2021.
[30] S. Werner, A. Rettinger, L. Halilaj, and J. Luettin, “RETRA: Recurrent transformers for learning temporally contextualized knowledge graph embeddings,” in Extended Semantic Web Conference, 2020.
[31] L. Halilaj, I. Dindorkar, J. Luettin, and S. Rothermel, “A knowledge graph-based approach for situation comprehension in driving scenarios,” in Extended Semantic Web Conference, 2021.
[32] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov, “MultiPath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” in Conference on Robot Learning, 2019.
[33] S. Casas, W. Luo, and R. Urtasun, “IntentNet: Learning to predict intention from raw sensor data,” in Conference on Robot Learning, 2018.
[34] Y. Tang and R. Salakhutdinov, “Multiple futures prediction,” in Neural Information Processing Systems, 2019.
[35] K. Messaoud, I. Yahiaoui, A. Verroust-Blondet, and F. Nashashibi, “Attention based vehicle trajectory prediction,” IEEE Trans. Intell. Veh., vol. 6, no. 1, pp. 175–185, 2021.
[36] S. Park, G. Lee, M. Bhat, et al., “Diverse and admissible trajectory forecasting through multimodal context understanding,” in ECCV, 2020.
[37] Y. Yuan, X. Weng, Y. Ou, and K. Kitani, “AgentFormer: Agent-aware transformers for socio-temporal multi-agent forecasting,” IEEE/CVF ICCV, pp. 9793–9803, 2021.
[38] R. Girgis, F. Golemo, F. Codevilla, et al., “Latent variable sequential set transformers for joint multi-agent motion prediction.” ICLR, 2021.
[39] S. Khandelwal, W. Qi, J. Singh, et al., “What-if motion prediction for autonomous driving,” in IEEE/RJS IROS, 2022.
[40] Y. Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou, “Multimodal motion prediction with stacked transformers,” in 2021 IEEE/CVF CVPR, 2021.
[41] Z. Huang, X. Mo, and C. Lv, “Multi-modal motion prediction with transformer-based neural network for autonomous driving,” ICRA, 2021.
[42] J. Ngiam, V. Vasudevan, B. Caine, Z. Zhang, et al., “Scene Transformer: A unified architecture for predicting future trajectories of multiple agents,” in ICLR, 2022.
[43] N. Nayakanti, R. Al-Rfou, A. Zhou, et al., “Wayformer: Motion forecasting via simple & efficient attention networks,” ICRA, 2022.
[44] S. Shi, L. Jiang, D. Dai, and B. Schiele, “Motion transformer with global intention localization and local movement refinement,” in NeurIPS, 2022.
[45] J. P. Mercat, T. Gilles, N. E. Zoghby, et al., “Multi-head attention for multi-modal joint vehicle motion forecasting,” ICRA, 2019.
[46] Z. Zhou, L. Ye, J. Wang, K. Wu, and K. Lu, “HiVT: Hierarchical vector transformer for multi-agent motion prediction,” CVPR, 2022.
[47] N. Rhinehart, R. T. McAllister, K. Kitani, and S. Levine, “PRECOG: Prediction conditioned on goals in visual multi-agent settings,” IEEE/CVF ICCV, pp. 2821–2830, 2019.
[48] E. Amirloo, A. Rasouli, P. Lakner, M. Rohani, and J. Luo, “LatentFormer: Multi-agent transformer-based interaction modeling and trajectory prediction,” ArXiv, vol. abs/2203.01880, 2022.
[49] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data,” in ECCV, 2020.
[50] A. Seff, B. Cera, D. Chen, et al., “MotionLM: Multi-agent motion forecasting as language modeling,” ICCV, 2023.
[51] A. Keysan, A. Look, E. Kosman, G. Gürsun, J. Wagner, Y. Yao, and B. Rakitsch, “Can you text what is happening? integrating pre-trained language encoders into trajectory prediction models for autonomous driving,” ArXiv, vol. abs/2309.05282, 2023.
[52] Z. Huang, H. Liu, and C. Lv, “GameFormer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving,” IEEE/CVF ICCV, pp. 3880–3890, 2023.
[53] S. Casas, C. Gulino, S. Suo, K. Luo, et al., “Implicit latent variable model for scene-consistent motion forecasting,” in ECCV, 2020.
[54] S. V. Albrecht, C. Brewitt, J. Wilhelm, et al., “Interpretable goal-based prediction and planning for autonomous driving,” ICRA, 2020.
[55] J. Gu, C. Sun, and H. Zhao, “DenseTNT: End-to-end trajectory prediction from dense goal sets,” IEEE/CVF ICCV, 2021.
[56] N. Deo and M. M. Trivedi, “Trajectory forecasts in unknown environments conditioned on grid-based plans,” ArXiv, vol. abs/2001.00735, 2020.
[57] Q. Lu, W. Han, J. Ling, et al., “KEMP: Keyframe-based hierarchical end-to-end deep model for long- term trajectory prediction,” ICRA, 2022.
[58] D.-H. Park, H. Ryu, Y. Yang, J. Cho, et al., “Leveraging future relationship reasoning for vehicle trajectory prediction,” ICLR, 2023.
[59] A. Naumann, F. Hertlein, D. Grimm, M. Zipfl, S. Thoma, A. Rettinger, L. Halilaj, J. Luettin, S. Schmid, and H. Caesar, “Lanelet2 for nuscenes: Enabling spatial semantic relationships and diverse map-based anchor paths,” in IEEE/CVF CVPR, 2023, pp. 3247–3256.
[60] Z. Zhou, J. Wang, Y.-H. Li, and Y.-K. Huang, “Query-centric trajectory prediction,” 2023 IEEE/CVF CVPR, pp. 17 863–17 873, 2023.
[61] J. Chen, Z. Wang, J. Wang, and B. Cai, “Q-EANet: Implicit social modeling for trajectory prediction via experience-anchored queries,” IET Intelligent Transport Systems, 2023.
[62] M. Zipfl, F. Hertlein, A. Rettinger, S. Thoma, L. Halilaj, J. Luettin, S. Schmid, and C. A. Henson, “Relation-based motion prediction using traffic scene graphs,” IEEE ITSC, pp. 825–831, 2022.
[63] X. Wang, H. Ji, C. Shi, B. Wang, et al., “Heterogeneous graph attention network,” in The world wide web conference, 2019.
[64] B. Kim, S. H. Park, S. Lee, et al., “LaPred: Lane-aware prediction of multi-modal future trajectories of dynamic agents,” in IEEE/CVF CVPR, 2021.
[65] T. Gilles, S. Sabatini, D. V. Tsishkou, et al., “GOHOME: Graph-oriented heatmap output for future motion estimation,” ICRA, 2021.
[66] N. Deo, E. M. Wolff, and O. Beijbom, “Multimodal trajectory prediction conditioned on lane-graph traversals,” in CoRL, 2021.
[67] Z. Hu, Y. Dong, K. Wang, and Y. Sun, “Heterogeneous graph transformer,” in Proceedings of the web conference, 2020, pp. 2704–2710.