Diving Deeper Into Pedestrian Behavior Understanding: Intention Estimation, Action Prediction, and Event Risk Assessment

Amir Rasouli¹ and Iuliia Kotseruba² ¹Huawei Technologies Canada, [email protected]²York University, [email protected]. Work was done while at Huawei Technologies Canada.

Abstract

In this paper, we delve into the pedestrian behavior understanding problem from the perspective of three different tasks: intention estimation, action prediction, and event risk assessment. We first define the tasks and discuss how these tasks are represented and annotated in two widely used pedestrian datasets, JAAD and PIE. We then propose a new benchmark based on these definitions, available annotations, and three new classes of metrics, each designed to assess different aspects of the model performance.

We apply the new evaluation approach to examine four SOTA prediction models on each task and compare their performance w.r.t. metrics and input modalities. In particular, we analyze the differences between intention estimation and action prediction tasks by considering various scenarios and contextual factors. Lastly, we examine model agreement across these two tasks to show their complementary role. The proposed benchmark reveals new facts about the role of different data modalities, the tasks, and relevant data properties. We conclude by elaborating on our findings and proposing future research directions¹¹1Code is available at github.com/aras62/PIE/tree/master/scenarioEval.

I Introduction

Safety is the primary concern for predicting pedestrian behavior in traffic. The problem can be formulated as determining whether the pedestrian’s action will lead them to appear in the path of the vehicle. There is a growing number of solutions to this problem that aim to anticipate pedestrians’ actions (e.g., crossing the road) from monocular videos and vehicle sensors. Although benchmark datasets established for this task continue to register robust performance improvements, there remain some outstanding issues.

One of the ongoing concerns is the conflation of intention and action prediction tasks in the literature. Particularly, after the introduction of datasets that provide data for both tasks (e.g., [1, 2]), it has become difficult to discern models trained for intention estimation and action prediction as the terms are often used interchangeably. Additionally, these tasks only indicate potential risk, but on their own are not sufficient to measure the direct impact of the predicted events on the behavior of the intelligent vehicle.

Another issue is the narrow focus of evaluation procedures that measure performance by averaging accuracy of models over all observations. For safety purposes, a deeper understanding of the algorithm performance is needed, particularly, because most models are difficult to interpret. For example, it is important to assess how early the models can forecast future actions and how consistent the predictions remain as the vehicle approaches the pedestrian.

The contributions of this paper are summarized as follows: 1) We provide a formal definition for intention and action prediction tasks; 2) We introduce an event risk assessment task designed to measure the impact of the predicted action on the ego-vehicle (see Figure 1); 3) We propose new metrics that focus on measuring how timely, balanced, and consistent are model predictions; 4) We evaluate state-of-the-art (SOTA) models on all three tasks, with particular focus on highlighting the differences between intention estimation and action prediction, identifying what factors impact each task, and assessing model agreement on both.

Refer to caption — Figure 1: An overview of different tasks of pedestrian behavior understanding. Top: connections between different tasks—definite (solid arrows) and probable (dashed arrows). Bottom: examples of pedestrians with different types of behavior and associated risks.

II Related Work

II-A Task definitions

We define the following three tasks related to understanding pedestrian behavior in traffic: intention estimation, action prediction, and event risk assessment. They follow in this order: first, the pedestrian decides to cross (intention), begins crossing the road if circumstances permit (action), which may or may not put them in a way of the ego-vehicle (risk) as shown in Figure 1.

Intention vs. action. The difference between forming a goal and acting on it was already established in the 1890s [3] and became a part of several theories of human behavior [4, 5], as well as more recent implementations [6, 7, 1]. Following these works, we consider crossing intention a precursor of action. Intention is a state of mind, so it cannot be observed directly, but may lead to action under certain conditions. In contrast, crossing action is an observable event of the pedestrian crossing the road in front of the ego-vehicle.

In the literature, “intent(ion) prediction” [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] and “action prediction” [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37] occur with almost equal frequency but, with few exceptions [11, 35, 36], both terms mean predicting pedestrian actions. Here, we refer to the tasks as intention estimation since intention exists in the present, and action prediction, as it is concerned with future events. In intelligent driving, intention (e.g. in the form of the goal or target of the agent [38, 39]) is often used in conjunction with action prediction for improved accuracy. Here, we evaluate these tasks separately to highlight their differences and identify the challenges pertaining to each task.

Event risk assessment. Assessing the risk posed by other vehicles or pedestrians is crucial for intelligent driving systems [40, 41, 42]. Pedestrian intention and upcoming action can indicate the possibility of risky events, but on their own they are not sufficient to measure the impact of those events on the intelligent vehicle. Trajectory prediction models directly estimate future locations of pedestrians, as a sequence of coordinates [43, 1] or/and final destination [37, 29], but further interpretation is needed to determine their potential risk. For example, one can estimate whether the pedestrian will end up in the driver’s comfort zone [42]. Here, we extend this idea to the egocentric setting and propose to directly assess the future risk of pedestrian action with respect to the ego-vehicle based on the risk regions in the image plane that are aligned with the center of the ego-vehicle.

II-B Model evaluation

A number of datasets for studying and modeling pedestrian behavior have been proposed [23, 1, 44, 10, 45, 2, 46], out of which JAAD [23] and PIE [1] are currently the most used. The majority of models trained and evaluated on these datasets [21, 22, 19, 18] follow the protocol in [47].

JAAD and PIE provide multi-modal data consisting of monocular video footage filmed from inside the moving vehicle and annotations: spatial (bounding boxes for pedestrians and relevant objects, pedestrian poses), textual (labels describing properties of the scene, pedestrian behaviors and attributes, and drivers’ actions), and numeric (vehicle telemetry). Existing models rely on a variety of input modalities, including visual context [23, 8, 9, 26, 10], pedestrian poses [30], or bounding boxes [32], or a combination of these modalities for more robust performance [33, 19, 47]. In this work, we evaluate models with different input modalities to highlight performance differences on the proposed tasks.

Past evaluation approaches relied on a subset of classification metrics, such as accuracy, recall, precision, AUC, and F1-score. In addition, results of individual models were related to various aspects of the data, such as time-to-event [24, 47], observation length [24, 34], prediction horizon [10, 32], input features [24, 17], scale and occlusion of bounding boxes [47], and ego-vehicle speed [22]. However, in all cases, metrics are averaged over all samples across different time horizons and pedestrian instances. Such evaluation assesses the overall performance, but fails to address consistency of the models and their limitations in predicting different horizons or risk levels. Here, we propose several additional metrics to capture the latter aspects of the models.

III Experiment Setup

We evaluate four SOTA action prediction models, SFGRU [24], BiPed [29], PCPA [47], and PedFormer [37], on two public benchmarks – Pedestrian Intention Estimation (PIE) dataset [1] and JAAD [23]. Below, we discuss data processing, model properties, and metrics definitions.

III-A Data

Action and intention annotations. Both datasets contain annotated videos of traffic scenes. Two types of annotations are the most relevant here: pedestrian intentions (which reflect their motivation to cross) and crossing actions (that specify whether they will cross in front of the ego-vehicle).

Because intentions are not directly observable, in PIE, intention labels were aggregated from the responses of human subjects who viewed the clips from the dataset and indicated whether pedestrians in them intended (or wanted) to cross the street (not necessarily in front of the ego-vehicle). These scores were averaged and used as intention labels. Note that these intention labels are not ground truth per se, but rather a probabilistic estimation of pedestrians’ intentions.

Intent labels in JAAD are binary (not probabilistic) and are assigned as follows: non-crossing intent is assigned to all bystander pedestrians, i.e. those that do not interact with the ego-vehicle or are deemed irrelevant by the annotators, and the rest are considered as having a crossing intent. Due to the obvious biases in the labeling process and lack of experimental validation (as in PIE), these labels do not effectively reflect intentions of pedestrians and therefore will not be used for the experiments.

Crossing actions in both datasets simply state whether the pedestrian was observed crossing the road in front of the ego-vehicle, however, intention annotations are different.

Data split. For all tasks, we set the observation length to $0.5s$ ( $15$ frames) and extract samples with a $30\%$ overlap to get a more uniform distribution. Note that for the remainder of the paper, instance refers to the entire track of the individual pedestrian and sample refers to a portion of this track, comprised of observation and prediction.

III-A1 Intention estimation

As mentioned earlier, in PIE, intention estimation labels are collected from human subjects who viewed short clips of pedestrians extracted from the dataset. The start and end points of these clips are specified in the annotations with exp_start_point and critical_point tags for each pedestrian instance (see Figure 2). We use these points to sample data and discard instances shorter than observation length.

Given the probabilistic nature of intention labels ( $\mathrm{intention}\in[0,1]$ ), we divide them into three equal intervals for no-crossing intention (NCI), unsure intention (UI) and crossing intention (CI), respectively (see Table I).

TABLE I: Number of samples for intention and action tasks.

Intention						Action
		Train	Test	Val	# Ped			Train	Test	Val	# Ped
	NCI	1922	756	449	329		NC	4163	3203	1218	1282
	UI	1009	835	312	220		C	1417	1255	327	499
	CI	5282	4881	1538	1285	PIE	Total	5580	4458	1545	1781
PIE	Total	8213	6472	2299	1834		NC	4456	3548	702	1519
							C	1122	769	107	353
						JAAD	Total	5578	4317	809	1872

III-A2 Action prediction

Crossing events in both PIE and JAAD datasets are labeled as crossing_point. This tag indicates the frame when the pedestrian started crossing or the last frame the pedestrian was visible if they did not cross in front of the ego-vehicle. We extract samples with time-to-event (TTE) between $1$ to $3s$ ( $30$ to $90$ frames) (see Figure 2) and only exclude samples below TTE of $1s$ and instances shorter than observation length (see Table I).

III-A3 Event risk assessment

We divide the image plane into equal vertical regions $160$ pixels wide (double the average width of the pedestrians’ bounding boxes). As a result, there are a total of $12$ regions representing 6 classes of risk (due to symmetry, as shown in Figure 3). The prediction horizon is set to $3s$ , double the average reaction time to surprise events [48]. The risk class is assigned based on the center coordinate of the bounding box at the end of the prediction horizon, calculated from the last observation frame. If for a given pedestrian the bounding box coordinate at prediction time does not exist (i.e. the pedestrian is not visible anymore), the last available bounding box is selected.

III-B Models

We select models with different architectures and input modalities to highlight the differences between different design choices. BiPed [29] and PedFormer [37] are multitask models that simultaneously predict trajectories and actions of pedestrians. Although architecturally different, both models rely mainly on observed trajectories, ego-vehicle sensors, and some visual context (in the form of semantic maps for interaction modeling). The other two models, SFGRU [24] and PCPA [47], are single-task, i.e., predict only the probability of crossing. Besides ego-dynamics and pedestrian trajectories, they rely on visual information (actual images of pedestrians and their surroundings) and pedestrian poses.

To adapt these models for intention estimation and event risk assessment tasks, we modify their objective functions, while using default parameters. For BiPed and PedFormer, we change only the crossing action task and keep the auxiliary trajectory and grid prediction tasks the same.

III-C Base metrics

Following [47], we report the results using common classification metrics: accuracy (Acc), Area Under the Curve (AUC), F1, and precision (Prec). To mitigate effects of class imbalances in the datasets, we also compute balanced accuracy (bAcc) and mean average precision (mAP).

III-D Weighted metrics

III-D1 Action prediction

In general, prediction is easier closer to the event, as more contextual cues become available. However, in safety-critical applications like driving, it is vital to make accurate predictions as early as possible. As a result, we propose a per-sample weighted average of metrics based on the TTE of the samples. i.e., the closer a sample is to the event, the lower is its weight. We express weights using a exponential function as follows:

\displaystyle\omega_{a}=\frac{\exp^{-\frac{1}{2}\times(\frac{d_{tte}}{\sigma})% ^{2}}}{\sum_{TTE}\omega},\text{ }d_{tte}=\frac{\max(TTE)-tte_{a}}{\max(TTE)},

where $\omega_{a}$ and $tte_{a}$ are the weight and TTE of the sample and $\max(TTE)$ is the maximum of TTEs across all samples, in this case $3s$ . We set $\sigma=0.3$ empirically.

III-D2 Event risk assessment

Future locations of pedestrians have different implications for the ego-vehicle. Pedestrians who are directly in front of the vehicle pose more risk because they are more difficult to avoid. With this insight, we assign more weight to high risk samples ending in the center of the camera view, and gradually lower the weight towards the edges (Figure 3). The weights are given by,

	$\displaystyle\omega_{r}=\exp^{-\frac{1}{2}\times(\frac{d_{cls}}{m\sigma})^{2}},$
	$\displaystyle d_{cls}=\begin{cases}\|cls_{r}-\lceil m\rceil\|,&\parbox{91.04872% pt}{$\text{if }N_{rc}\mod 2=1\lor(N_{rc}\mod 2=0\land cls_{r}<=m$}\\ \|cls_{r}-\lceil m\rceil-1\|,&\text{otherwise}\end{cases}$

where $\omega_{r}$ and $cls_{r}$ are weight and class index of the sample, respectively, $N_{rc}$ denotes the total number of risk regions (classes), and $m=\lceil\frac{N_{rc}}{2}\rceil$ . We set $\sigma=0.5$ empirically.

III-E Per-instance metrics

To capture model consistency, we propose three new metrics. In order to compute them, we first rearrange the samples corresponding to each pedestrian instance in the order they have been originally extracted, i.e., resembling a moving window. Each pedestrian instance can contain between $1$ and $N$ samples, depending on the length of the track. We then compute metrics per-instance and average them over all instances (see Figure 4).

III-E1 Soft metrics

Metrics are averaged across samples corresponding to the unique pedestrian instance.

III-E2 Hard metrics

For each pedestrian instance, if the most confident class of all samples are the same, then that class is the prediction of that instance. Otherwise, we set the prediction of the instance as incorrect. For instance, if the correct label of a pedestrian is crossing, but the model predicted at least one of the samples as non-crossing, we set the overall prediction of that pedestrian to non-crossing.

III-E3 Confidence delta

We compute changes in the model’s confidence score for each class between two consecutive samples and report maximum and average delta given by,

\text{conf}_{\Delta}=\frac{\sum_{i\in\{1,...,n-1\}}|\text{conf}^{i}_{c}-\text{% conf}^{i+1}_{c}|}{n-1},

where $n$ is the total number of samples in that instance and $\text{conf}_{c}$ is the model’s confidence for class $c$ .

IV Evaluation: Intention and Action

IV-A Performance on benchmarks

TABLE II: Experiment results for intention estimation in PIE.

\uparrow

and

\downarrow

mean higher or lower values are better respectively.

	PIE
	Base $\uparrow$						Soft/Hard $\uparrow$				Max/Avg $\downarrow$
Model	mAP	bAcc	AUC	Acc	Prec	F1	Acc	bAcc	Prec	F1	$\text{conf}_{\Delta}$
SFGRU	0.44	0.41	0.65	0.76	0.52	0.41	0.76/0.67	0.41/0.31	0.78/0.28	0.41/0.29	0.14/0.04
PCPA	0.46	0.42	0.65	0.75	0.41	0.41	0.77/0.59	0.45/0.27	0.45/0.28	0.44/0.27	0.21/0.06
BiPed	0.42	0.41	0.66	0.73	0.40	0.39	0.75/0.57	0.40/0.27	0.38/0.29	0.38/0.27	0.30/0.09
PedFormer	0.46	0.45	0.70	0.70	0.43	0.43	0.72/0.55	0.44/0.28	0.42/0.31	0.42/0.29	0.23/0.06

TABLE III: Experiment results for action prediction.

\uparrow

and

\downarrow

mean higher or lower values are better respectively..

	PIE
	Base $\uparrow$			Base/Weighted $\uparrow$			Soft/Hard $\uparrow$				Max/Avg $\downarrow$
Model	mAP	bAcc	AUC	Acc	Prec	F1	bAcc	Acc	Prec	F1	$\text{conf}_{\Delta}$
SFGRU	0.75	0.75	0.87	0.83/0.82	0.79/0.73	0.65/0.64	0.76/0.61	0.85/0.72	0.87/0.50	0.67/0.43	0.10/0.04
PCPA	0.79	0.81	0.89	0.86/0.85	0.81/0.75	0.74/0.72	0.81/0.63	0.88/0.72	0.90/0.50	0.76/0.46	0.17/0.07
BiPed	0.84	0.84	0.93	0.89/0.88	0.84/0.80	0.79/0.77	0.86/0.66	0.90/0.74	0.90/0.56	0.82/0.51	0.16/0.06
PedFormer	0.88	0.85	0.94	0.89/0.88	0.85/0.81	0.79/0.78	0.86/0.76	0.91/0.80	0.89/0.66	0.82/0.65	0.12/0.04
	JAAD
SFGRU	0.61	0.76	0.85	0.86/0.87	0.60/0.63	0.61/0.61	0.74/0.60	0.86/0.74	0.62/0.34	0.59/0.31	0.17/0.07
PCPA	0.56	0.72	0.81	0.83/0.83	0.51/0.51	0.54/0.53	0.72/0.55	0.84/0.70	0.54/0.27	0.54/0.24	0.16/0.07
BiPed	0.60	0.75	0.86	0.85/0.85	0.57/0.58	0.58/0.59	0.75/0.52	0.87/0.68	0.68/0.23	0.61/0.20	0.21/0.10
PedFormer	0.63	0.78	0.87	0.86/0.85	0.58/0.58	0.62/0.61	0.78/0.58	0.87/0.72	0.64/0.32	0.64/0.28	0.15/0.07

Intention estimation results in Tables II show that the performance of the models differ. While PedFormer stands out on most base metrics, on others it lags significantly behind SFGRU – by up to $6\%$ in accuracy and $9\%$ in precision. Hard and soft metrics highlight other differences. For instance, PCPA and SFGRU that rely more on visual context clearly dominate. Of particular interest is soft precision of SFGRU which is $33\%$ higher than the next best model, PCPA. This shows that SFGRU is fairly successful at distinguishing between pedestrian instances of different intention classes.

Hard metrics show significant performance drop on all models, suggesting that their overall consistency is low, i.e. predicted intentions fluctuate across successive samples within the pedestrian instances. But the amount of fluctuation varies and once again SFGRU is the best with the lowest max and avg $\text{conf}_{\Delta}$ , $7\%$ and $2\%$ better compared to PCPA.

Action prediction results, shown in Table III, tell a different story. On PIE, the more dynamics oriented model, PedFormer, is the best on almost all metrics, and hard metrics in particular, where it performs at least $6\%$ better on Acc and up to $14\%$ better on F1. This indicates that pedestrian trajectories and ego-motion are important for predicting upcoming actions. On the same dataset, SFGRU remains the most consistent model, having the best $\text{conf}_{\Delta}$ but with the lowest hard scores.

On JAAD, which only has categorical ego-motion information, the results are mixed. PedFormer is still better on most base metrics, whereas SFGRU does better on Acc and more so on Prec. While soft metrics favor PedFormer, SFGRU is more successful in hard metrics, showing better instance-wise consistency. On $\text{conf}_{\Delta}$ , all models perform very similarly, except for BiPed, which is more inconsistent.

The difference between base metrics and their weighted counterparts for most models is marginal, with some exceptions on JAAD, where weighted metrics are better. This can be due to noise or general inconsistencies in JAAD annotations that in some cases favor samples extracted further away from the time of event points. On PIE, weighted precision of some models is lower than base by up to $6\%$ . This is expected, as the uncertainty of prediction is higher farther away from the crossing event.

IV-B Impact of context modeling on performance

Given the high variability of traffic scenes and black-box nature of deep learning models, it is generally difficult to pinpoint what contextual elements contributed to the correct predictions. However, considering the architectural differences between the models and their performance on different tasks and datasets, we can observe some patterns.

Referring back to the task definitions in Section II-A, pedestrian intention reflects their motivation or goal, which is not affected by the environmental factors. For instance, if a pedestrian wants to cross the road to go to a store, they will try to do so either at the signalized crossing or by finding a safe gap in traffic if the nearest controlled intersection is too far. However, the pedestrian’s ultimate intention (or goal) to cross the road remains constant, unless their objective of going to the store on the other side of the street changes.

In terms of modeling context, different algorithms rely on different sources of information. While SFGRU and PCPA use images of pedestrians (cropped to capture surrounding context) and their poses, BiPed and PedFormer mainly rely on dynamics (pedestrians’ and the ego-vehicle’s) and only use visual context represented by semantic maps of the scenes for modeling interactions between the agents.

On the intention task, performance of all models is generally low due to the inherent difficulty of the task and the uncertainty and noise present in human judgment annotations. However, models that rely more on visual features tend to be more successful at distinguishing different intention classes. Thus, in addition to dynamic cues, it is likely that detailed visual context, e.g. head orientation, posture, etc., is necessarily for accurate estimation.

In comparison, results on action prediction show that effective modeling of scene dynamics is more important for prediction accuracy. This is apparent in the ranking of the models in Tables II and III on the PIE dataset. On JAAD which lacks accurate dynamics information, we can see a significant degradation of the performance on action prediction on all metrics.

IV-C Scenario-based analysis

[Uncaptioned image] — TABLE IV: The mAP for intention estimation and action prediction tasks on PIE for different scenarios. The colors are computed over all cells for each task. Green and red indicate the best and worst performance, respectively.

To further highlight the differences between intention estimation and action prediction tasks, we take a closer look at the data properties. We use the PIE dataset and based on its annotations, split scenarios into three categories, pedestrian, ego-vehicle, and environment. For pedestrians, we consider two factors: scale equal to the height of bounding boxes in pixels and state indicating whether the pedestrian is walking or standing during the observation period. In the case of the ego-vehicle, we consider speed in km/h. For environment, we use signal and road. Signal indicates the traffic light state w.r.t. the ego-vehicle: forbid (red), allow (yellow or green), or none (no traffic light is present). Road specifies the direction of traffic: one-way or two-way.

For all factors, we average the characteristics over the observation period. We then select the best performing models for each task and calculate their mAP. The results in Table IV show distinct impacts of different contextual factors on intention estimation and action prediction as color distributions are reflecting model performance on each task.

Pedestrian factors. Pedestrian state is the most notable factor that plays an important role for both tasks. In particular, walking towards the road is a strong indicator of crossing intention as well as likely occurrence of crossing event (consistent with the finding that walking pedestrians tend to accept shorter gaps [49]). The intention and upcoming action of standing pedestrians are more difficult to estimate, thus the significant drop in the model performance.

There are some differences observable on pedestrian scale factor. For instance, PedFormer achieves the best performance on the largest scale on intention and worst on action. As there is no clear pattern of change across different scales, such performance difference can be due to the presence of other factors or perhaps the differences in the distribution of samples between intention and action tasks.

Ego-vehicle factors. On the action prediction task, there is a very clear pattern of performance degradation across different ego-speed thresholds: from a high of $95\%$ on scenarios where the ego-vehicle is stationary to a low of $8\%$ when it is moving fast. This can be attributed to the impact of ego-motion on how pedestrian movements appear in the image plane, as well as increased uncertainty in pedestrian decision-making as they need to negotiate with the ego-vehicle in order to cross. On the intention task, however, the changes due to ego-speed are much smaller, only a few percent. This supports our earlier claim that behaviors of other road users should have a minor, if any, impact on the intentions of pedestrians.

Environment factors. Traffic light forbid (red) state has a significant impact on the action prediction accuracy: when the ego-vehicle is not moving, its influence on pedestrians’ behavior is minimized. The action prediction models also perform better on two-way streets. This can potentially be due to the properties of the dataset, in which the percentage of challenging scenarios, e.g., jaywalking, is smaller compared to narrower one-way streets. In comparison, on the intention task, there is no significant performance gap across different environment factors, pointing to the fact that intention is less influenced by them.

It should be noted that some fluctuations within scenarios and across tasks can be due to the uneven distribution of samples and the fact that not all samples have overlaps (as shown in Figure 2). In addition, a single factor analysis may not reveal all dependencies because factors may interact. For example, in ego-speed scenarios different data partitions may include pedestrians of different scales or with different states. However, a multi-factors analysis (similar to [43]) was not feasible due to the sparsity of the data in each subclass of the tasks for training and evaluation.

IV-D Model agreement between intention and action

In this section, we test whether models’ predictions for both tasks are in agreement, e.g., if the model predicts that the pedestrian has no intention of crossing, will it also predict that they will not cross the road? To test this, we train PedFormer on both intention and action tasks simultaneously. For consistency with human judgment annotations, we use the samples from the intention task (see Figure 2).

As shown in Table V, we split the results based on the combination of intention and action classes (rows) and PedFormer’s performance (correct or incorrect) on each task (columns). Overall, the model infers both tasks correctly on $63\%$ of the samples, both incorrectly on $3.5\%$ , and only one of the tasks correctly on the remainder.

In $11\%$ of the partially agreeing samples, intention but not action is predicted correctly, meaning that some cues that help estimate intention do not necessarily reflect whether action will take place. Therefore, in such cases, intention can play a complementary role.

TABLE VI: The results for the event risk assessment task.

\uparrow

and

\downarrow

mean higher or lower values are better, respectively.

	PIE
	Base $\uparrow$			Base/Weighted $\uparrow$			Soft/Hard $\uparrow$				Max/Avg $\downarrow$
Model	mAP	bAcc	AUC	Acc	Prec	F1	bAcc	Acc	Prec	F1	$\text{conf}_{\Delta}$
SFGRU	0.26	0.22	0.86	0.72/0.50	0.23/0.21	0.21/0.19	0.23/0.16	0.75/0.64	0.25/0.16	0.23/0.14	0.18/0.02
PCPA	0.24	0.18	0.86	0.70/0.47	0.17/0.16	0.16/0.13	0.17/0.15	0.72/0.62	0.15/0.15	0.16/0.14	0.20/0.02
BiPed	0.26	0.23	0.87	0.73/0.51	0.28/0.25	0.24/0.21	0.21/0.15	0.75/0.65	0.22/0.20	0.20/0.14	0.23/0.02
PedFormer	0.45	0.42	0.95	0.80/0.65	0.49/0.49	0.44/0.44	0.39/0.25	0.82/0.69	0.45/0.44	0.40/0.29	0.17/0.01
	JAAD
SFGRU	0.23	0.24	0.75	0.42/0.27	0.24/0.21	0.22/0.19	0.25/0.15	0.47/0.33	0.25/0.30	0.23/0.14	0.20/0.02
PCPA	0.16	0.16	0.67	0.37/0.18	0.15/0.12	0.11/0.07	0.16/0.14	0.41/0.34	0.09/0.08	0.11/0.09	0.15/0.01
BiPed	0.21	0.22	0.73	0.39/0.24	0.21/0.18	0.20/0.17	0.21/0.12	0.43/0.29	0.19/0.20	0.18/0.10	0.29/0.04
PedFormer	0.42	0.40	0.90	0.53/0.42	0.43/0.41	0.41/0.39	0.43/0.22	0.60/0.36	0.48/0.43	0.44/0.25	0.22/0.02

In the cases where only action is correct ( $22.6\%$ ), the samples are distributed more evenly across all three classes of intention, but for the majority no crossing action takes place. This can primarily be attributed to lack of pedestrian dynamics cues, which comprise about $80\%$ of the samples. Inferring crossing intention of standing pedestrians is more challenging and requires analysis of other contextual cues, such as proximity to the road, transit station nearby, pose, etc. In fact, $78\%$ of all unsure intention samples in the test set contain pedestrians that are mainly standing for the duration of observation. Taking all the results into account, PedFormer rarely assigns unsure labels. Note that some variations across different categories can be due to the insufficient samples, e.g., those with non-crossing intention and crossing action.

V Event Risk Assessment

As discussed in Sec. 2, event risk assessment task determines whether a given pedestrian would pose a risk to the ego-vehicle based on how close the predicted location of the pedestrian is w.r.t. the center of the vehicle. The results are summarized in Table VI. Given the nature of the task and its dependency on accurate dynamics estimation, PedFormer achieves significantly better performance on all metrics, except $\text{conf}_{\Delta}$ on JAAD on which PCPA stands out.

Plots of per-class distribution of PedFormer and SFGRU models (see Figure 5) provide a better insight into challenging areas. As anticipated, performance of the models is the best closer to the edges of the frame, partly due to more data and also the fact that pedestrians that appear there often do not cross and remain stationary at the time of prediction. In such cases, observed movements of the pedestrians in the image plane are only due to the ego-motion of the vehicle, hence the uncertainty of risk prediction is lower.

The performance on other risk classes is mixed, which can be attributed to the variability of sample properties in each class, since the number of samples in each class are about the same. Of note is the increasing gap between two models towards the center of the frame, or areas where pedestrians are crossing. Therefore, better estimation of motion is crucial for accurate prediction, hence PedFormer is more successful.

On PIE, the trends in the model performance diverge. For instance, from left to right, 4th to 5th and 6th to 7th columns (classes) the performance of PedFormer declines whereas SFGRU’s improves. At the same time, PedFormer’s performance drops from 5th to 6th class by more than $20\%$ and for SFGRU the difference is less than $10\%$ . This indicates that besides motion information, in some scenarios, visual context is also important for accurate prediction.

On the JAAD dataset, performance of both models trends similarly, although PedFormer is better on all classes. This is due to the fact that JAAD does not provide accurate ego-motion. As a result, both models rely mainly on the changes of pedestrian bounding boxes for reasoning about dynamics.

VI Discussion and Conclusions

VI-A Role of different tasks

We discussed three tasks, intention estimation, action prediction, and event risk assessment, and argued that each plays a unique role in understanding and forecasting pedestrian behavior. Intention estimation reflects what an observed pedestrian wants to do. Knowing the intention, one can expect certain types of actions to follow or determine relevance of the agent, e.g., for causal representation learning [50]. Action is the realization of the intention (motive) of the pedestrian and can be seen as an early cue for possible types of motions to expect. For instance, predicted crossing action implies the possibility of lateral motion in front of the vehicle. Lastly, event risk assessment helps estimate the potential danger of pedestrian action.

Besides the importance of each individual task, our agreement study revealed that for a large subset of the samples, the model does not correctly predict both tasks at the same time, partially due to the absence of necessary cues for accurate prediction of either task. This finding suggests that in such scenarios, different tasks can play a complementary role for understanding pedestrian behavior.

VI-B Model performance

Our data analysis and empirical evaluations of existing pedestrian prediction models showed that 1) performance of the models on different tasks is not similar, 2) the ranking of the models varies across different tasks and metrics, and 3) performance on each task is not necessarily impacted by the same factors. As a result, models trained on different tasks (e.g., intention and action) are not directly comparable.

The new per-instance metrics revealed the lack of temporal consistency in model predictions, even within the short span of $2s$ . In intelligent driving systems, such performance fluctuations can lead to irrational behavior by the vehicle or its driver and, consequently, pose risks to other road users. These consistency issues should be remedied in the future works, perhaps, by enforcing instance-wise temporal continuity during training and minimizing the effect of spurious correlations with contextual elements.

In this work, we primarily focused on the input modality of the prediction models and showed that the models that rely more on visual context tend to be more successful on intention estimation and dynamics-oriented models perform better on action and risk tasks. In the future, this analysis can be extended to evaluating the contributions of different modules and interaction between the tasks in a single framework, such as multitasking [29, 37] or chain reasoning [1].

VI-C Factors that matter for each task

Our factor analysis highlighted that various contextual elements affect model performance differently on each task. While action is predominantly influenced by dynamics and environmental factors, performance on intention estimation is mainly influenced by pedestrian state. This outcome suggests that intention estimation is inherently a more challenging task as it is not directly influenced by the surrounding environment of the pedestrians Hence, intention estimation models should also capture more subtle behavioral cues, such as the proximity of the pedestrian to the road, their orientation with respect to the road and the ego-vehicle (e.g. looking at the traffic), their closeness to other objects (e.g. transit station) that may reveal their intention (e.g. taking a ride), their other activities (e.g. talking on the phone or another person), and many more elements that require context analysis and spatial reasoning.

Lastly, we examined the variability of the model performance on a single task w.r.t. different data properties. Although some major trends were observed, according to behavioral literature [49], there are many more factors that potentially impact the motives and behaviors of pedestrians. As a result, a more fine-grained analysis based on multiple contextual factors is needed, but was not possible here due to data limitations. Future data collection efforts could mitigate this issue by ensuring sufficient data scale and diversity.

References

[1] A. Rasouli, I. Kotseruba, T. Kunic, and J. K. Tsotsos, “PIE: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction,” in ICCV, 2019.
[2] T. Chen, T. **g, R. Tian, Y. Chen, J. Domeyer, H. Toyoda, R. Sherony, and Z. Ding, “PSI: A pedestrian behavior dataset for socially intelligent autonomous car,” arXiv:2112.02604, 2021.
[3] W. James, The Principles Of Psychology. Henry Holt and Co., 1890.
[4] J. R. Searle, Intentionality: An Essay in the Philosophy of Mind. Cambridge University Press, 1983.
[5] M. Bratman, Intention, Plans, and Practical Reason. Harvard University Press, 1987.
[6] M. Georgeff, B. Pell, M. Pollack, M. Tambe, and M. Wooldridge, “The belief-desire-intention model of agency,” in ATAL, 1999.
[7] F. Schneemann and P. Heinemann, “Context-based detection of pedestrian crossing intention for autonomous driving in urban environments,” in IROS, 2016.
[8] K. Saleh, M. Hossny, and S. Nahavandi, “Real-time intent prediction of pedestrians for autonomous ground vehicles via spatio-temporal DenseNet,” in ICRA, 2019.
[9] P. Gujjar and R. Vaughan, “Classifying pedestrian actions in advance using predicted video of urban driving scenes,” in ICRA, 2019.
[10] B. Liu, E. Adeli, Z. Cao, K.-H. Lee, A. Shenoi, A. Gaidon, and J. C. Niebles, “Spatiotemporal relationship reasoning for pedestrian intent prediction,” RAL, vol. 5, no. 2, pp. 3485–3492, 2020.
[11] Z. Sui, Y. Zhou, X. Zhao, A. Chen, and Y. Ni, “Joint intention and trajectory prediction based on transformer,” in IROS, 2021.
[12] B. Yang, W. Zhan, P. Wang, C. Chan, Y. Cai, and N. Wang, “Crossing or not? context-based recognition of pedestrian crossing intention in the urban environment,” Trans. ITS, vol. 23, no. 6, pp. 5338–5349, 2021.
[13] H. Wu, S. Zheng, Q. Xu, and J. Wang, “Applying the extended theory of planned behavior to pedestrian intention estimation,” in IV, 2021.
[14] T. Chen, R. Tian, and Z. Ding, “Visual reasoning using graph convolutional networks for predicting pedestrian crossing intention,” in ICCVW, 2021.
[15] H. Razali, T. Mordan, and A. Alahi, “Pedestrian intention prediction: A convolutional bottom-up multi-task approach,” Transportation Research Part C: Emerging Technologies, vol. 130, p. 103259, 2021.
[16] S. Zhang, M. Abdel-Aty, Y. Wu, and O. Zheng, “Pedestrian crossing intention prediction at red-light using pose estimation,” Trans. ITS, vol. 23, no. 3, pp. 2331–2339, 2021.
[17] A. Y. Naik, A. Bighashdel, P. Jancura, and G. Dubbelman, “Scene spatio-temporal graph convolutional network for pedestrian intention estimation,” in IV, 2022.
[18] X. Song, M. Kang, S. Zhou, J. Wang, Y. Mao, and N. Zheng, “Pedestrian intention prediction based on traffic-aware scene graph model,” in IROS, 2022.
[19] D. Yang, H. Zhang, E. Yurtsever, K. A. Redmill, and Ü. Özgüner, “Predicting pedestrian crossing intention with feature fusion and spatio-temporal attention,” Trans. IV, vol. 7, no. 2, pp. 221–230, 2022.
[20] J. Huang, A. Gautam, and S. Saripalli, “Learning pedestrian actions to ensure safe autonomous driving,” in IV, 2023.
[21] Y. Zhou, G. Tan, R. Zhong, Y. Li, and C. Gou, “PIT: Progressive interaction transformer for pedestrian crossing intention prediction,” Trans. ITS, 2023.
[22] M. Upreti, J. Ramesh, C. Kumar, B. Chakraborty, V. Balisavira, M. Roth, V. Kaiser, and P. Czech, “Traffic light and uncertainty aware pedestrian crossing intention prediction for automated vehicles,” in IV, 2023.
[23] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior,” in ICCVW, 2017.
[24] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Pedestrian action anticipation using contextual feature fusion in stacked RNNs,” in BMVC, 2019.
[25] P. R. G. Cadena, M. Yang, Y. Qian, and C. Wang, “Pedestrian graph: Pedestrian crossing prediction based on 2d pose estimation and graph convolutional networks,” in ITSC, 2019.
[26] M. Chaabane, A. Trabelsi, N. Blanchard, and R. Beveridge, “Looking ahead: Anticipating pedestrians crossing with future frames prediction,” in WACV, 2020.
[27] T. Yau, S. Malekmohammadi, A. Rasouli, P. Lakner, M. Rohani, and J. Luo, “Graph-SIM: A graph-based spatiotemporal interaction modelling for pedestrian action prediction,” in ICRA, 2021.
[28] Y. Yao, E. Atkins, M. J. Roberson, R. Vasudevan, and X. Du, “Coupling intent and action for pedestrian crossing behavior prediction,” arXiv:2105.04133, 2021.
[29] A. Rasouli, M. Rohani, and J. Luo, “Bifold and semantic reasoning for pedestrian behavior prediction,” in ICCV, 2021.
[30] A. Singh and U. Suddamalla, “Multi-input fusion for practical pedestrian intention prediction,” in ICCVW, 2021.
[31] J. Gesnouin, S. Pechberti, B. Stanciulescu, and F. Moutarde, “Assessing cross-dataset generalization of pedestrian crossing predictors,” in IV, 2022.
[32] L. Achaji, J. Moreau, T. Fouqueray, F. Aioun, and F. Charpillet, “Is attention to bounding boxes all you need for pedestrian action prediction?” in IV, 2022.
[33] A. Rasouli, T. Yau, M. Rohani, and J. Luo, “Multi-modal hybrid architecture for pedestrian action prediction,” in IV, 2022.
[34] P. R. G. Cadena, Y. Qian, C. Wang, and M. Yang, “Pedestrian graph +: A fast pedestrian crossing prediction model based on graph convolutional networks,” Trans. ITS, vol. 23, no. 11, pp. 21 050–21 061, 2022.
[35] X. Zhai, Z. Hu, D. Yang, L. Zhou, and J. Liu, “Social aware multi-modal pedestrian crossing behavior prediction,” in ACCV, 2022.
[36] C. Zhang, A. H. Kalantari, Y. Yang, Z. Ni, G. Markkula, N. Merat, and C. Berger, “Cross or wait? predicting pedestrian interaction outcomes at unsignalized crossings,” IV, 2023.
[37] A. Rasouli and I. Kotseruba, “PedFormer: Pedestrian behavior prediction via cross-modal attention modulation and gated multitask learning,” in ICRA, 2023.
[38] R. Karim, S. M. A. Shabestary, and A. Rasouli, “Destine: Dynamic goal queries with temporal transductive alignment for trajectory prediction,” ICRA, 2024.
[39] M. Wang, X. Zhu, C. Yu, W. Li, Y. Ma, R. **, X. Ren, D. Ren, M. Wang, and W. Yang, “Ganet: Goal area network for motion forecasting,” in ICRA, 2023.
[40] D. Xiao, M. Dianati, W. G. Geiger, and R. Woodman, “Review of graph-based hazardous event detection methods for autonomous driving systems,” Trans. ITS, 2023.
[41] D. Wang, W. Fu, Q. Song, and J. Zhou, “Potential risk assessment for safe driving of autonomous vehicles under occluded vision,” Scientific Reports, vol. 12, no. 1, p. 4981, 2022.
[42] M. Herman, J. Wagner, V. Prabhakaran, N. Möser, H. Ziesche, W. Ahmed, L. Bürkle, E. Kloppenburg, and C. Gläser, “Pedestrian behavior prediction for automated driving: Requirements, metrics, and relevant features,” Trans. ITS, vol. 23, no. 9, pp. 14 922–14 937, 2021.
[43] A. Rasouli, “A novel benchmarking paradigm and a scale-and motion-aware model for egocentric pedestrian trajectory prediction,” arXiv:2310.10424, 2023.
[44] S. Malla, B. Dariush, and C. Choi, “TITAN: Future forecast using action priors,” in CVPR, 2020.
[45] A. Rasouli, T. Yau, P. Lakner, S. Malekmohammadi, M. Rohani, and J. Luo, “PePScenes: A novel dataset and baseline for pedestrian action prediction in 3D,” in NeurIPSW, 2020.
[46] D. Guo, T. Mordan, and A. Alahi, “Pedestrian Stop and Go Forecasting with Hybrid Feature Fusion,” in ICRA, 2022.
[47] I. Kotseruba, A. Rasouli, and J. K. Tsotsos, “Benchmark for evaluating pedestrian action prediction,” in WACV, 2021.
[48] M. Green, “” how long does it take to stop?” methodological analysis of driver perception-brake times,” Transportation Human Factors, vol. 2, no. 3, pp. 195–216, 2000.
[49] A. Rasouli and J. K. Tsotsos, “Autonomous vehicles that interact with pedestrians: A survey of theory and practice,” Trans. ITS, vol. 21, no. 3, pp. 900–918, 2019.
[50] R. Roelofs, L. Sun, B. Caine, K. S. Refaat, B. Sapp, S. Ettinger, and W. Chai, “CausalAgents: A robustness benchmark for motion forecasting using causal relationships,” arXiv:2207.03586, 2022.