FootBots: A Transformer-based Architecture
for Motion Prediction in Soccer

Abstract

Motion prediction in soccer involves capturing complex dynamics from player and ball interactions. We present FootBots, an encoder-decoder transformer-based architecture addressing motion prediction and conditioned motion prediction through equivariance properties. FootBots captures temporal and social dynamics using set attention blocks and multi-attention block decoder. Our evaluation utilizes two datasets: a real soccer dataset and a tailored synthetic one. Insights from the synthetic dataset highlight the effectiveness of FootBots’ social attention mechanism and the significance of conditioned motion prediction. Empirical results on real soccer data demonstrate that FootBots outperforms baselines in motion prediction and excels in conditioned tasks, such as predicting the players based on the ball position, predicting the offensive (defensive) team based on the ball and the defensive (offensive) team, and predicting the ball position based on all players. Our evaluation connects quantitative and qualitative findings. https://youtu.be/9kaEkfzG3L8

Index Terms—  Motion prediction, Signal forecasting, Transformer, Trajectory understanding, Soccer.

1 Introduction

Multi-agent Motion Prediction (MP) holds critical importance across various domains, encompassing financial economics [1], human pose estimation [2, 3, 4, 5], pedestrian behavior analysis [6, 7, 8, 9, 10, 11], and sports analytics [12, 13, 14, 15, 16, 17]. This task involves forecasting future positions and motions of multiple agents in a shared environment. In multi-agent sports like soccer, accurate motion prediction deepens insights into player behavior, team dynamics, and on-field decision-making processes.

The intricacies of soccer, characterized by swift changes in player positions, rapid ball movements, and intricate team coordination, underscore the need for advanced models surpassing conventional non-social-aware motion prediction. In image processing and computer vision, these challenges crucially apply to enhancing player tracking and re-identification, where precise position forecasts contribute to performance analysis, team strategies, and overall gameplay understanding. These forecasts can also be used to simulate team strategies and obtain advanced metrics [18]. The dynamics of soccer, coupled with player interactions, drive the Conditioned Motion Prediction (CMP) task, adressing scenarios like predicting player trajectories based on ball position [13, 17]. Figure 1 illustrates the MP task and four different CMP tasks in detail that we consider in this paper.

Given the dynamic nature of soccer, a robust model must exhibit permutation equivariance [13, 14], adapting to varying player compositions and interactions. Transformer-based architectures [19], known for their handling of varying-length sequences and permutation equivariance, have gained traction in motion prediction tasks [10, 11], including sports [17].

This research introduces a comprehensive approach to soccer MP and CMP, employing a transformer-based model. Our model captures soccer’s intricate properties, leveraging permutation equivariance and historical data to predict ball and player 2D trajectories. Evaluation against baselines showcases social awareness in soccer. The study introduces a synthetic dataset tailored for this research, and utilizes a real one from LaLiga 2022-2023.

Refer to caption
Fig. 1: Motion prediction in soccer. The method predicts both player and ball motions from partial 2D trajectories under specified conditions. In the figure, squares represent the end positions of ground truth offensive and defensive team players, crosses denote their predicted positions, and circles indicate the final ball ones. Five different tasks (MP, CMP1-4) for the same test sequence are displayed, everyone of them is tailored to predict specific subsets of agents, as specified in parentheses.

2 Related work

The evolution of multi-agent motion prediction originates from human pose techniques, initially utilizing Recurrent Neural Networks (RNN) [2, 3] to capture temporal dynamics. This evolution extended to pedestrian motion, fusing RNN with social pooling [6, 7] to capture social interactions. However, [20] introduced an RNN baseline void of social encoding, yet surpassing social pooling methods.

Recognizing the significance of social interactions led to the adoption of Graph Neural Networks (GNN) in pedestrian modeling [8, 9], coupled with recurrent techniques. Additionally, siMLPe [5] demonstrates the effectiveness of a Multi-Layer Perceptron (MLP) architecture in capturing temporal and spatial dynamics for human motion prediction. Transformer-based architectures also made substantial contributions, showcasing their ability to encode temporal and social dynamics in human pose estimation [4] and in forecasting pedestrian and vehicle trajectories [10, 11, 21].

In sports, initial methods focused on generating long-term basketball trajectories using Variational Autoencoders (VAE) [22] and Variational Recurrent Neural Networks (VRNN) [23, 24]. These variational methods were later outperformed by a multi-modal RNN-based architecture [15]. To leverage permutation equivariance without the necessity of agent ordering, subsequent research integrated VRNN with GNN for generating multi-agent trajectories in basketball and soccer [13, 25, 16]. Addressing concerns associated with accumulated errors in recurrent methods, [14] combined Graph Attention Networks (GAT) with temporal convolutional networks in soccer. Transformer-based models have also found applications in the sports domain, demonstrating superior performance compared to graph-recurrent-based approaches [17, 26] in NBA trajectories. Nevertheless, simultaneously conducting attention in both temporal and social dimensions still incurs a notable computational cost.

Our method introduces a tailored transformer encoder-decoder for soccer, adeptly adapting to the sport’s intricacies involving a higher number of agents compared to basketball. To enhance computational efficiency, the model is optimized by sequentially decoupling temporal and social attentions. We leverage permutation equivariance alongside the agents’ ordering. Moreover, we showcase the effectiveness of our approach in addressing soccer’s CMP task using a tailored synthetic dataset and a real one, skillfully capturing intricate agent interactions.

3 Our Method

3.1 Problem statement

Consider a set of M𝑀M\in\mathbb{N}italic_M ∈ blackboard_N agent measures (M1𝑀1M-1italic_M - 1 players and a ball in our context), X={𝐱1,,𝐱M}𝑋superscript𝐱1superscript𝐱𝑀X=\{\mathbf{x}^{1},\ldots,\mathbf{x}^{M}\}italic_X = { bold_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_x start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT } where every measure contains k𝑘kitalic_k elements. The measures evolve over a time horizon of t+T𝑡𝑇t+T\in\mathbb{N}italic_t + italic_T ∈ blackboard_N where t𝑡titalic_t and T𝑇Titalic_T are positive integers. Particularly, t𝑡titalic_t represents the observations, while T𝑇Titalic_T covers predictions spanning approximately 4 seconds [12, 13, 14]. On the one hand, we can define both prior 𝒳0:tsubscript𝒳:0𝑡\mathcal{X}_{0:t}caligraphic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT and posterior 𝒳t+1:t+Tsubscript𝒳:𝑡1𝑡𝑇\mathcal{X}_{t+1:t+T}caligraphic_X start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT sequences where specific 𝐱tmsubscriptsuperscript𝐱𝑚𝑡\mathbf{x}^{m}_{t}bold_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are collected to define our MP problem as:

f(𝒳0:t)=𝒳t+1:t+T,𝑓subscript𝒳:0𝑡subscript𝒳:𝑡1𝑡𝑇f(\mathcal{X}_{0:t})=\mathcal{X}_{t+1:t+T},italic_f ( caligraphic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) = caligraphic_X start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT , (1)

where f𝑓fitalic_f represents a function to infer the posterior data from the prior.

On the other hand, we also consider a CMP problem. In particular, CMP predicts the positions of some specific agents (P𝑃Pitalic_P), given the positions of other ones (C𝐶Citalic_C) in a scenario. Let 𝒳0:t+TCsubscriptsuperscript𝒳𝐶:0𝑡𝑇\mathcal{X}^{C}_{0:t+T}caligraphic_X start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 : italic_t + italic_T end_POSTSUBSCRIPT represent the complete sequence of observations for the C𝐶Citalic_C agents, and let 𝒳0:tPsubscriptsuperscript𝒳𝑃:0𝑡\mathcal{X}^{P}_{0:t}caligraphic_X start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT and 𝒳t+1:TPsubscriptsuperscript𝒳𝑃:𝑡1𝑇\mathcal{X}^{P}_{t+1:T}caligraphic_X start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT denote the prior and posterior states of the agents to be predicted, respectively. Then we can define the model fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to sort out the CMP as:

fc(𝒳0:t+TC,𝒳0:tP)=𝒳t+1:t+TP.subscript𝑓𝑐subscriptsuperscript𝒳𝐶:0𝑡𝑇subscriptsuperscript𝒳𝑃:0𝑡subscriptsuperscript𝒳𝑃:𝑡1𝑡𝑇f_{c}(\mathcal{X}^{C}_{0:t+T},\mathcal{X}^{P}_{0:t})=\mathcal{X}^{P}_{t+1:t+T}.italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 : italic_t + italic_T end_POSTSUBSCRIPT , caligraphic_X start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) = caligraphic_X start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT . (2)

The challenge in MP is to forecast trajectories of all M𝑀Mitalic_M agents (players and ball). In contrast, CMP encompasses various tasks within soccer that we introduce next:

CMP1: Predicting players’ positions with the ball as a conditioning agent.

CMP2: Predicting defensive team players’ positions using all other agents as conditioning ones.

CMP3: Predicting offensive team players’ positions using all other agents as conditioning ones.

CMP4: Predicting the ball position with players as conditioning agents.

Refer to caption
Fig. 2: FootBots architecture in soccer. FootBots exploits an encoder-decoder structure with sequential temporal and social attention mechanisms. It incorporates Set Attention Blocks to encode temporal SABT and social SABS dynamics represented in the context 𝒞𝒞\mathcal{C}caligraphic_C. The Multi-Attention Block Decoder in the temporal axis (MABDT) and SABS in the decoder generate the predicted trajectories. FootBots is capable of solving both MP and CMP tasks in soccer, with an input of the decoder \mathcal{H}caligraphic_H varying depending on the task.

3.2 Attention mechanisms

Attention mechanisms are effective at capturing relationships in sequences or sets. Given n𝑛nitalic_n queries 𝐐𝐐\mathbf{Q}bold_Q of dimension dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, nvsubscript𝑛𝑣n_{v}italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT keys 𝐊𝐊\mathbf{K}bold_K of dimension dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and nvsubscript𝑛𝑣n_{v}italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT values 𝐕𝐕\mathbf{V}bold_V of dimension dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, the attention mechanism computes weighted value sums using compatibility between queries and keys as:

Attention(𝐐,𝐊,𝐕)=softmax(𝐐𝐊dk)𝐕,Attention𝐐𝐊𝐕softmaxsuperscript𝐐𝐊topsubscript𝑑𝑘𝐕\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\text{softmax}\left(\frac{% \mathbf{Q}\mathbf{K}^{\top}}{\sqrt{d_{k}}}\right)\mathbf{V},Attention ( bold_Q , bold_K , bold_V ) = softmax ( divide start_ARG bold_QK start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) bold_V , (3)

with 𝐐n×dk𝐐superscript𝑛subscript𝑑𝑘\mathbf{Q}\in\mathbb{R}^{n\times d_{k}}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝐊nv×dk𝐊superscriptsubscript𝑛𝑣subscript𝑑𝑘\mathbf{K}\in\mathbb{R}^{n_{v}\times d_{k}}bold_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝐕nv×dv𝐕superscriptsubscript𝑛𝑣subscript𝑑𝑣\mathbf{V}\in\mathbb{R}^{n_{v}\times d_{v}}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. In practice, the attention mechanism is often extended with multiple attention heads, also called Multi-Head Attention (MHA), originally introduced in Transformer architecture [19], allowing the model to capture different relations in the data.

The MHA operation was extended to work on sets by defining a Set Attention Block (SAB) [27], an adaptation of the encoder block of the Transformer that lacks the positional encoding. The MHA itself provides the property of permutation equivariance, making the SAB a permutation-equivariant operation. Finally, the original Transformer [19] builds the output using its decoder, also called Multi-Attention Block Decoder (MABD) [11], which utilizes cross-attention to take into account the encoder output.

In motion prediction, attention mechanisms can capture temporal dynamics and social interactions among agents. Temporal attention focuses on sequence dynamics, assigning varying weights to observations for accurate future motion prediction. Social attention complements that by considering interactions among agents, capturing spatial relationships, and accounting for collaborative behaviors.

3.3 FootBots

In designing FootBots, we leverage SAB and MABD blocks drawing inspiration from [11, 10]. FootBots utilizes an encoder-decoder structure with sequential temporal and social attention mechanisms, capturing player-ball dynamics over time. Figure 2 illustrates its key components.

The encoder of FootBots operates by handling input observations denoted as 𝒳0:tsubscript𝒳:0𝑡\mathcal{X}_{0:t}caligraphic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT through a Feed-Forward Network (FFNFFN\mathrm{FFN}roman_FFN), which is supplemented with a positional encoder (PEPE\mathrm{PE}roman_PE) to ensure the temporal ordering of the data. To capture the dynamics in both time and social interactions, we utilize SAB. More specifically, SABT for temporal encoding dynamics and SABS for social encoding. Those blocks are responsible for capturing interactions among the players and the ball during the prior state, resulting in the generation of a representation tensor called context, denoted as 𝒞𝒞\mathcal{C}caligraphic_C.

Given the prior sequence of sets 𝒳0:t=(X0,,Xt)subscript𝒳:0𝑡subscript𝑋0subscript𝑋𝑡\mathcal{X}_{0:t}=(X_{0},\ldots,X_{t})caligraphic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with dimensions [M,t,k]𝑀𝑡𝑘[M,t,k][ italic_M , italic_t , italic_k ], we define the encoder operations as follows:

𝒞=SABS(SABT(SABT(PE+FFN(𝒳0:t)))),𝒞subscriptSAB𝑆subscriptSAB𝑇subscriptSAB𝑇PEFFNsubscript𝒳:0𝑡\small\mathcal{C}=\text{SAB}_{S}\left(\text{SAB}_{T}(\text{SAB}_{T}(\textrm{PE% }+\text{FFN}(\mathcal{X}_{0:t})))\right),caligraphic_C = SAB start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( SAB start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( SAB start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( PE + FFN ( caligraphic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) ) ) ) , (4)

where 𝒞𝒞\mathcal{C}caligraphic_C has the dimension [M,t,d]𝑀𝑡𝑑[M,t,d][ italic_M , italic_t , italic_d ], and d𝑑ditalic_d is the chosen dimension for the embeddings.

In the decoder, FootBots generates predicted trajectories 𝒪𝒪\mathcal{O}caligraphic_O to approximate 𝒳t+1:t+Tsubscript𝒳:𝑡1𝑡𝑇\mathcal{X}_{t+1:t+T}caligraphic_X start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT. To this end, it employs a MABD in the temporal axis (MABDT) and a SABS in the social one. MABDT takes into account the output of the encoder 𝒞𝒞\mathcal{C}caligraphic_C and the input of the decoder \mathcal{H}caligraphic_H, which depends on the task at hand: 1) in MP, \mathcal{H}caligraphic_H relies on the last T𝑇Titalic_T time steps of 𝒞𝒞\mathcal{C}caligraphic_C; 2) in CMP, \mathcal{H}caligraphic_H incorporates the observations of the conditioning agents during the prediction interval 𝒳t+1:t+TCsubscriptsuperscript𝒳𝐶:𝑡1𝑡𝑇\mathcal{X}^{C}_{t+1:t+T}caligraphic_X start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT, along with the last T𝑇Titalic_T time steps of 𝒞𝒞\mathcal{C}caligraphic_C for the agents of interest (P𝑃Pitalic_P), 𝒞tT:tPsubscriptsuperscript𝒞𝑃:𝑡𝑇𝑡\mathcal{C}^{P}_{t-T:t}caligraphic_C start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_T : italic_t end_POSTSUBSCRIPT.

Given Tt𝑇𝑡T\leq titalic_T ≤ italic_t, where T𝑇Titalic_T represents the desired frames to predict, we outline the decoder operations in the following equation:

𝒪=𝒳t+FFN(SABS(MABDT(MABDT(PE+,𝒞),𝒞))),𝒪subscript𝒳𝑡FFNsubscriptSAB𝑆subscriptMABD𝑇subscriptMABD𝑇PE𝒞𝒞\small\mathcal{O}=\mathcal{X}_{t}+\text{FFN}\left(\text{SAB}_{S}\left(\text{% MABD}_{T}(\text{MABD}_{T}(\textrm{PE}+\mathcal{H},\mathcal{C}),\mathcal{C})% \right)\right),caligraphic_O = caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + FFN ( SAB start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( MABD start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( MABD start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( PE + caligraphic_H , caligraphic_C ) , caligraphic_C ) ) ) , (5)

where 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the last set of observations of the prior and \mathcal{H}caligraphic_H is one input of the MABDT operation whose definition depends on the task: =𝒞tT:tsubscript𝒞:𝑡𝑇𝑡\mathcal{H}=\mathcal{C}_{t-T:t}caligraphic_H = caligraphic_C start_POSTSUBSCRIPT italic_t - italic_T : italic_t end_POSTSUBSCRIPT for MP; and =FFN(𝒳t+1:t+TC)𝒞tT:tPFFNsubscriptsuperscript𝒳𝐶:𝑡1𝑡𝑇subscriptsuperscript𝒞𝑃:𝑡𝑇𝑡\mathcal{H}=\text{FFN}(\mathcal{X}^{C}_{t+1:t+T})\cup\mathcal{C}^{P}_{t-T:t}caligraphic_H = FFN ( caligraphic_X start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) ∪ caligraphic_C start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_T : italic_t end_POSTSUBSCRIPT for CMP. It is worth noting that these relations justify our model’s constraint that Tt𝑇𝑡T\leq titalic_T ≤ italic_t.

The dimension of 𝒪𝒪\mathcal{O}caligraphic_O is [M,T,2]𝑀𝑇2[M,T,2][ italic_M , italic_T , 2 ]. A skip connection enables residual learning, particularly benefiting early frame precision. In MP task, FootBots maintains agent permutation equivariance, while in CMP task, it demonstrates partial equivariance for both conditioning agents (C𝐶Citalic_C) and agents of interest (P𝑃Pitalic_P), preserving this property within their respective subsets.

For completeness, we also propose a non-social variant of our approach denoted by FootBots NS. In this case, our method substitutes social SABS attention with temporal SABT one, omitting social interactions to showcase their importance.

3.4 Loss

The loss function utilized is an Average Displacement Error (ADE), a widely used loss metric in trajectory prediction tasks. It computes the average l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm between the predicted trajectories and the ground truth (GT) ones as:

ADE=1MTm=1Mj=t+1t+T𝐱^jm𝐱jm2,ADE1𝑀𝑇superscriptsubscript𝑚1𝑀superscriptsubscript𝑗𝑡1𝑡𝑇subscriptdelimited-∥∥subscriptsuperscript^𝐱𝑚𝑗subscriptsuperscript𝐱𝑚𝑗2\text{ADE}=\frac{1}{MT}\sum_{m=1}^{M}\sum_{j=t+1}^{t+T}\left\lVert\hat{\mathbf% {x}}^{m}_{j}-\mathbf{x}^{m}_{j}\right\rVert_{2},ADE = divide start_ARG 1 end_ARG start_ARG italic_M italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_T end_POSTSUPERSCRIPT ∥ over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (6)

where 𝐱^jmsubscriptsuperscript^𝐱𝑚𝑗\hat{\mathbf{x}}^{m}_{j}over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝐱jmsubscriptsuperscript𝐱𝑚𝑗\mathbf{x}^{m}_{j}bold_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represent the predicted and its corresponding GT position of the m𝑚mitalic_m-th agent at j𝑗jitalic_j-th time step.

4 Experimental evaluation

In this section, we present our experimental results on motion prediction and provide a comparison with respect to competing approaches. For quantitative evaluation, we consider a subset of agents M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG by using four types of metrics. First, we consider the ADEM^subscriptADE^𝑀\textrm{ADE}_{\hat{M}}ADE start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT metric in Eq. (6) for just the set M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG.

Second, we propose a Final Displacement Error (FDE) that measures the final deviation between the prediction and the corresponding ground truth location as:

FDEM^=1|M^|mM^𝐱t+Tm𝐱^t+Tm2.subscriptFDE^𝑀1^𝑀subscript𝑚^𝑀subscriptdelimited-∥∥subscriptsuperscript𝐱𝑚𝑡𝑇subscriptsuperscript^𝐱𝑚𝑡𝑇2\small\textrm{FDE}_{\hat{M}}=\frac{1}{|\hat{M}|}\sum_{m\in\hat{M}}\left\lVert% \mathbf{x}^{m}_{t+T}-\hat{\mathbf{x}}^{m}_{t+T}\right\rVert_{2}\,.FDE start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_M end_ARG | end_ARG ∑ start_POSTSUBSCRIPT italic_m ∈ over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT ∥ bold_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT - over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

For completeness, we also propose to compute the Maximum Error (MaxErr) as:

MaxErrM^=1|M^|mM^maxj{t+1,,t+T}𝐱jm𝐱^jm2,\small\textrm{MaxErr}_{\hat{M}}=\frac{1}{|\hat{M}|}\sum_{m\in\hat{M}}\max_{j% \in\{t+1,\ldots,t+T\}}\left\lVert\mathbf{x}^{m}_{j}-\hat{\mathbf{x}}^{m}_{j}% \right\rVert_{2}\,,MaxErr start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_M end_ARG | end_ARG ∑ start_POSTSUBSCRIPT italic_m ∈ over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_j ∈ { italic_t + 1 , … , italic_t + italic_T } end_POSTSUBSCRIPT ∥ bold_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

and the missing rate (MR) to show the percentage of predictions having an l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm greater than 1 meter as:

MRM^=1|M^|TmM^j=t+1t+T𝕀[𝐱jm𝐱^jm2>1],subscriptMR^𝑀1^𝑀𝑇subscript𝑚^𝑀superscriptsubscript𝑗𝑡1𝑡𝑇𝕀delimited-[]subscriptdelimited-∥∥subscriptsuperscript𝐱𝑚𝑗subscriptsuperscript^𝐱𝑚𝑗21\small\textrm{MR}_{\hat{M}}=\frac{1}{|\hat{M}|T}\sum_{m\in\hat{M}}\sum_{j=t+1}% ^{t+T}\mathbb{I}\left[\left\lVert\mathbf{x}^{m}_{j}-\hat{\mathbf{x}}^{m}_{j}% \right\rVert_{2}>1\right],MR start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_M end_ARG | italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_m ∈ over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_T end_POSTSUPERSCRIPT blackboard_I [ ∥ bold_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 1 ] ,

with 𝕀()𝕀\mathbb{I}(\cdot)blackboard_I ( ⋅ ) an indicator function.

4.1 Datasets

In this paper, we propose to validate our model on synthetic and real datasets. Next, we provide the most important details for each of them.

Synthetic dataset: Created to aid investigation and model development for both MP and CMP tasks, this dataset contains 10,000 training and 1,500 validation sequences. Each sequence spans 20 time steps: 10 for prior (t𝑡titalic_t) and the rest for target (T𝑇Titalic_T). Five agents, including four players and one ball, compose each sequence. The ball follows a linear trajectory initially, but can randomly change direction at a chosen time step, followed by another linear path. Noise is introduced for trajectory variability, causing slight deviations from linearity. Player behaviors encompass remaining stationary with noise (S), linear trajectories with noise (L), and non-linear paths influenced by the ball’s position as an attractor (A). The number of players for each behavior is randomly determined, all within a bounded square region [15,15]×[15,15]15151515[-15,15]\times[-15,15][ - 15 , 15 ] × [ - 15 , 15 ] resembling meters in real world. In this dataset, each agent’s observation is limited to its xy𝑥𝑦xyitalic_x italic_y location, leading to k=2𝑘2k=2italic_k = 2 according to the problem formulation.

Real dataset: This dataset comprises actual data from 283 matches of LaLiga’s 2022-2023 season, capturing agent motions using advanced computer vision techniques. Each match is divided into sequences representing 9.6 seconds, down-sampled to 6.25 frames per second. Each sequence, consisting of 60 frames, is divided into prior states (35 frames or 5.6 seconds) and target ones (25 frames or 4 seconds). Only trajectories of all 20 field players (excluding goalkeepers) are considered, and agent order is standardized. The dataset is split into 243 matches (82,954 sequences) for training, 20 matches (6,258 sequences) for testing, and 20 matches (7,500 sequences) for validation. Trajectories are normalized to fit within [1,1]×[1,1]1111[-1,1]\times[-1,1][ - 1 , 1 ] × [ - 1 , 1 ] by dividing by the largest pitch dimension, and spatial realignment ensures the possession team’s rightward motion on the pitch. When using FootBots and its non-social variant FootBots NS, each agent’s observation includes its 2D position and an associated integer indicating its role: ball, defensive team player, or offensive team player. Therefore, in this case we consider k=3𝑘3k=3italic_k = 3 elements. In contrast, when using other baseline models, input is confined to the 2D position, leading to k=2𝑘2k=2italic_k = 2.

4.2 Synthetic evaluation

This initial scenario aims to emphasize the motivation and importance of effectively solving the CMP tasks in soccer. By analyzing the synthetic dataset, we can evaluate the significance of the social SABS and its ability to address the specific CMP1 task.

The outcomes of our proposed methods on the synthetic dataset are detailed in Table 1, highlighting the performance metrics for both MP and CMP1 tasks. In terms of ADEPsubscriptADE𝑃\textrm{ADE}_{P}ADE start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, FootBots exhibit a slight advantage over FootBots NS when addressing the MP task. This divergence can be attributed to FootBots’ capacity to anticipate the motions of ball-attracted players (A) by leveraging the anticipated ball position, facilitated by the social SABS. However, it is important to note that despite these enhancements, the persistence of high ADEballsubscriptADEball\textrm{ADE}_{\text{ball}}ADE start_POSTSUBSCRIPT ball end_POSTSUBSCRIPT values implies that predictions for type A players may be subject to error propagation. Shifting to the CMP1 task, a notable improvement in predictions is observed.

Model Task Predict (P𝑃Pitalic_P) ADEPsubscriptADE𝑃\textrm{ADE}_{P}ADE start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT(m) \downarrow ADEballsubscriptADEball\textrm{ADE}_{\text{ball}}ADE start_POSTSUBSCRIPT ball end_POSTSUBSCRIPT(m) \downarrow
FootBots NS MP Players+Ball 0.50 1.83
FootBots MP Players+Ball 0.44 1.72
FootBots CMP1 Players 0.10 -
Table 1: Evaluating our architecture on synthetic data.
Refer to caption
Fig. 3: Two examples from the synthetic dataset. The examples serve to visually compare the performance of FootBots NS and FootBots solving MP task, and FootBots solving CMP1. The predictions for different player types (S, L, and A) are evaluated, emphasizing the impact of incorporating social attention and the ball as the conditioning agent.

Nevertheless, for a deeper evaluation, qualitative analysis of example sequences is crucial. Figure 3 provides further insight, showcasing two instances from the validation set, each featuring distinct complexities in ball trajectory. These instances enable differentiation between FootBots NS and FootBots in the MP task, as well as between FootBots in MP and FootBots in CMP1. All three models adeptly predict static (type S) and linear (type L) player positions with commendable accuracy. However, FootBots NS faces challenges in accurately predicting type A player actions due to its reliance on extrapolating their past trajectories without considering ball-related factors. In the context of the MP task with FootBots, a noteworthy correlation emerges between predictions for type A players and the quality of ball prediction. Consequently, the precision of type A predictions is significantly reliant on accurate ball prediction. This is exemplified in sequence 1, where the ball trajectory retains linearity throughout the prediction time-frame, leading to almost precise predictions. Conversely, in sequence 2, deviations in ball prediction propagate errors to the predictions of type A players. The robustness of our model’s capacity to effectively address CMP1 task (conditioned on ball information), is emphasized in the concluding segments of our study. Across the two scenarios presented, the model consistently achieves nearly precise predictions within the CMP1 context.

4.3 Real evaluation

Refer to caption
Fig. 4: Qualitative evaluation and comparison on real data. The figure displays the estimated trajectories for approaches Velocity, RNN [20], baller2vec++ [26] and siMLPe [5]; and our solutions FootBots NS and FootBots, by solving the MP task.
Model Order Task Predict (P𝑃Pitalic_P) ADEPsubscriptADE𝑃\textrm{ADE}_{P}ADE start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT \downarrow ADEballsubscriptADEball\textrm{ADE}_{\text{ball}}ADE start_POSTSUBSCRIPT ball end_POSTSUBSCRIPT \downarrow MaxErrPsubscriptMaxErr𝑃\textrm{MaxErr}_{P}MaxErr start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT \downarrow FDEPsubscriptFDE𝑃\textrm{FDE}_{P}FDE start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT \downarrow MRPsubscriptMR𝑃\textrm{MR}_{P}MR start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT (%) \downarrow
Velocity None MP Players+Ball 3.27 9.39 7.34 7.27 67.50
RNN [20] None MP Players+Ball 2.67 6.91 5.56 5.43 65.54
FootBots NS (Ours) None MP Players+Ball 2.39 6.37 5.16 5.04 60.99
baller2vec++ [26] Approx-Equivariant MP Players+Ball 2.21 6.43 4.64 4.49 60.79
siMLPe [5] Role-based MP Players+Ball 2.18 6.15 4.73 4.55 59.71
FootBots (Ours) Equivariant MP Players+Ball 2.04 5.79 4.43 4.28 57.37
CMP1 Players 1.64 - 3.42 3.20 52.59
CMP2 Defensive 1.38 - 2.80 2.59 47.83
CMP3 Offensive 1.44 - 2.98 2.78 48.26
CMP4 Ball 2.72 2.72 5.93 4.27 64.11
Table 2: Quantitative evaluation and comparison for MP and CMP tasks on real data. The table provides a comprehensive comparison of our solution with various other approaches in MP task. All metrics, except MRPsubscriptMR𝑃\textrm{MR}_{P}MR start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, are in meters.

In the second evaluation scenario, we analyze the efficacy of FootBots in addressing the MP task by employing a real dataset. We conduct a comparative assessment with various baseline methods to gauge its performance. Additionally, utilizing the same real dataset, we evaluate FootBots’ performance in the CMP tasks. The considered baselines, including the already described FootBots NS, are outlined as follows:

Velocity: We employ velocity extrapolation as a preliminary benchmark, projecting agent predictions linearly based on observed velocity.

RNN: We implement an RNN with LSTM cells, using an encoder for input representation and an MLP decoder for prediction, a method proven effective in prior work [20].

siMLPe: Adapted from human pose prediction, siMLPe treats each joint as an agent, utilizing an MLP-based model with layer normalization and transposition operations for spatial-temporal dynamics using a fixed sequence length [5].

baller2vec++: Adapted from the basketball context, this approach conducts attention simultaneously in both temporal and social dimensions by modelling the attention mask [26].

It is important to note that Velocity, RNN [20], and FootBots NS operate independently for each agent, making the order of the input irrelevant. However, siMLPe [5] is a role-based model and not equivariant. In baller2vec++, they demonstrate minimal result variation when changing the ordering of agents, describing it as approximately equivariant. To ensure a fair comparison, the real dataset is ordered based on the initial positional role of each player.

In Table 2, we provide a comprehensive overview of metrics across methods solving the MP task. Our FootBots demonstrate superior performance in all metrics. The Velocity model shows a significant performance gap, attributed to unconstrained long-term predictions exceeding pitch boundaries. RNN [20] and FootBots NS, lacking agent interaction, lead to performance decline, mostly in the MaxErrPsubscriptMaxErr𝑃\textrm{MaxErr}_{P}MaxErr start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and FDEP metrics. This emphasizes the significance of social interaction in long-term trajectory prediction. In general, social-aware baselines like siMLPe [5] and baller2vec++ [26] achieve superior metrics compared to non-social methods. Although siMLPe competes well, MaxErrPsubscriptMaxErr𝑃\textrm{MaxErr}_{P}MaxErr start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and FDEP metrics are outperformed by transformer-based approaches like baller2vec++ and FootBots. Moreover, these methods can handle variable sequence length. However, baller2vec++ encounters difficulties with ball prediction, leading to suboptimal results due to error accumulation. FootBots excels in ball and players predictions, leverages permutation equivariance, and is more than six times faster than baller2vec++ in inference (73 vs 484 milliseconds), thanks to the decoupled attention. Importantly, in all MP results, ADEballsubscriptADE𝑏𝑎𝑙𝑙\textrm{ADE}_{ball}ADE start_POSTSUBSCRIPT italic_b italic_a italic_l italic_l end_POSTSUBSCRIPT consistently exceeds ADEPsubscriptADE𝑃\textrm{ADE}_{P}ADE start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT across methods, motivating CMP1 task.

Figure 4 provides an illustration of a particular trajectory prediction example, offering a comparative analysis of baselines in the MP task. Linear predictions based on Velocity baseline exhibit trajectories that are deemed non-sensical in certain instances. Both FootBots NS and RNN [20] models tend to generate shorter predicted trajectories, underscoring the imperative of incorporating social interaction for a comprehensive understanding of each player’s future motions. Despite suboptimal ball prediction, siMLPe [5] and baller2vec++ [26] excels in accurate player predictions, showcasing its robustness in capturing player interaction dynamics. Additionally, FootBots outperforms previous baselines both quantitatively and qualitatively, with enhanced ball prediction.

To ensure the generalization of our findings, we conduct a parallel analysis using the real dataset to address all the considered CMP tasks. Similar to the synthetic dataset, we initiate the evaluation by assessing the model’s performance in solving CMP1. Our investigation also covers CMP2, focusing on Defensive Players’ positions, and CMP3, targeting Offensive Players’ ones. Furthermore, we explore CMP4, which involves ball position prediction.

Quantitative results for FootBots across all CMP tasks are presented in Table 2. The solution for the CMP1 task demonstrates marked improvement compared to the MP task, highlighting a strong correlation between player positions and the ball. Noteworthy enhancements in prediction accuracy emerge when conditioning on the opposing team and ball locations in CMP2 and CMP3. It is worth noting that predicting offensive team behaviors is more challenging than predicting defensive ones, due to their inherent stochasticity. Moreover, FootBots adeptly utilize player interactions to provide accurate ball predictions, reducing the ADEballsubscriptADEball\textrm{ADE}_{\text{ball}}ADE start_POSTSUBSCRIPT ball end_POSTSUBSCRIPT metric from 5.79 to 2.72 meters compared to the MP task.

Figure 1 illustrates a sample of the real dataset featuring solutions for the MP task and all CMP tasks. This specific instance depicts an scenario characterized by an extended ball trajectory involving a pass to predict and swift player motions. Notably, within the MP task, the model encounters challenges in accurately predicting both the ball and player positions, attributed to the inherent speed of the sequence. However, with the integration of conditioning information, discernible enhancements in predictions become evident.

5 Conclusions

In this work, we introduced FootBots, a tailored trajectory prediction model for soccer contexts, and extensively evaluated its performance across diverse scenarios. The comparative analysis demonstrated FootBots’ superior performance over baseline models, showcasing its advantageous equivariance properties. Through synthetic dataset evaluation, FootBots excelled in predicting player positions, particularly when incorporating social attention and ball conditioning (CMP1), highlighting the importance of social interactions and ball incorporation. Extension to real data showcased FootBots’ grasp of defensive strategies (CMP2), improved offensive player predictions (CMP3), and effective player interaction utilization for ball position enhancement (CMP4). Remarkably, CMP4 exhibited significant error reduction compared to the MP task, affirming the effectiveness of using players as conditioning agents to enhance ball prediction accuracy.

References

  • [1] O. B. Sezer, M. U. Gudelek, and A. M. Ozbayoglu, “Financial time series forecasting with deep learning: A systematic literature review: 2005–2019,” Applied Soft Computing, vol. 90, pp. 106181, 2020.
  • [2] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent network models for human dynamics,” in ICCV, 2015.
  • [3] J. Martinez, M. J Black, and J. Romero, “On human motion prediction using recurrent neural networks,” in CVPR, 2017.
  • [4] E. Aksan, M. Kaufmann, P. Cao, and O. Hilliges, “A spatio-temporal transformer for 3D human motion prediction,” in 3DV, 2021, pp. 565–574.
  • [5] W. Guo, Y. Du, X. Shen, V. Lepetit, X. Alameda, and F. Moreno-Noguer, “Back to mlp: A simple baseline for human motion prediction,” in WACV, 2023.
  • [6] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social LSTM: Human trajectory prediction in crowded spaces,” in CVPR, 2016.
  • [7] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social GAN: Socially acceptable trajectories with generative adversarial networks,” in CVPR, 2018.
  • [8] V. Kosaraju, A. Sadeghian, R. Martín-Martín, I. Reid, H. Rezatofighi, and S. Savarese, “Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks,” NeurIPS, 2019.
  • [9] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data,” in ECCV, 2020.
  • [10] J. Ngiam, V. Vasudevan, B. Caine, Z. Zhang, H. L. Chiang, J. Ling, R. Roelofs, A. Bewley, C. Liu, A. Venugopal, et al., “Scene transformer: A unified architecture for predicting future trajectories of multiple agents,” in ICLR, 2021.
  • [11] R. Girgis, F. Golemo, F. Codevilla, M. Weiss, J. A. D’Souza, S. E. Kahou, F. Heide, and C. Pal, “Latent variable sequential set transformers for joint multi-agent motion prediction,” ICLR, 2022.
  • [12] H. M Le, Y. Yue, P. Carr, and P. Lucey, “Coordinated multi-agent imitation learning,” in ICML, 2017.
  • [13] R. A Yeh, A. G Schwing, J. Huang, and K. Murphy, “Diverse generation for multi-agent sports games,” in CVPR, 2019.
  • [14] D. Ding and H H. Huang, “A graph attention based approach for trajectory prediction in multi-agent sports games,” arXiv:2012.10531, 2020.
  • [15] S. Hauri, N. Djuric, V. Radosavljevic, and S. Vucetic, “Multi-modal trajectory prediction of nba players,” in WACV, 2021.
  • [16] S. Omidshafiei, D. Hennes, M. Garnelo, Z. Wang, A. Recasens, et al., “Multiagent off-screen behavior prediction in football,” Scientific reports, vol. 12, no. 1, pp. 8638, 2022.
  • [17] M. A Alcorn and A. Nguyen, “baller2vec: A multi-entity transformer for multi-agent spatiotemporal modeling,” arXiv:2102.03291, 2021.
  • [18] M. Teranishi, K. Tsutsui, K. Takeda, and K. Fujii, “Evaluation of creating scoring opportunities for teammates in soccer via trajectory prediction,” in MLSA, 2022.
  • [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, 2017.
  • [20] S. Becker, R. Hug, W. Hubner, and M. Arens, “Red: A simple but effective baseline predictor for the trajnet benchmark,” in ECCVW, 2018.
  • [21] Y. Yuan, X. Weng, Y. Ou, and K. M Kitani, “Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting,” in ICCV, 2021.
  • [22] P. Felsen, P. Lucey, and S. Ganguly, “Where will they go? predicting fine-grained adversarial multi-agent motion using conditional variational autoencoders,” in ECCV, 2018.
  • [23] E. Zhan, S. Zheng, Y. Yue, L. Sha, and P. Lucey, “Generating multi-agent trajectories using programmatic weak supervision,” arXiv:1803.07612, 2018.
  • [24] S. Zheng, Y. Yue, and J. Hobbs, “Generating long-term trajectories using deep hierarchical networks,” NeurIPS, 2016.
  • [25] C. Sun, P. Karlsson, J. Wu, J. B Tenenbaum, and K. Murphy, “Stochastic prediction of multi-agent interactions from partial observations,” arXiv:1902.09641, 2019.
  • [26] M. A Alcorn and A. Nguyen, “baller2vec++: A look-ahead multi-entity transformer for modeling coordinated agents,” arXiv:2104.11980, 2021.
  • [27] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh, “Set transformer: A framework for attention-based permutation-invariant neural networks,” in ICML, 2019.