License: arXiv.org perpetual non-exclusive license
arXiv:2403.10809v1 [cs.RO] 16 Mar 2024

Efficient Trajectory Forecasting and Generation with Conditional Flow Matching

Sean Ye11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT and Matthew C. Gombolay11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT *This work is supported in part by Office of Naval Research (ONR) under grant number N00173-21-1-G009 and the National Science Foundation under grant CNS-2219755,11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTAll authors are associated with the Institute of Robotics and Intelligent Machines (IRIM), Georgia Institute of Technology, Atlanta, GA, USA.Corresponding Author: Sean Ye, [email protected]
Abstract

Trajectory prediction and generation are vital for autonomous robots navigating dynamic environments. While prior research has typically focused on either prediction or generation, our approach unifies these tasks to provide a versatile framework and achieve state-of-the-art performance. Diffusion models, which are currently state-of-the-art for learned trajectory generation in long-horizon planning and offline reinforcement learning tasks, rely on a computationally intensive iterative sampling process. This slow process impedes the dynamic capabilities of robotic systems. In contrast, we introduce Trajectory Conditional Flow Matching (T-CFM), a novel data-driven approach that utilizes flow matching techniques to learn a solver time-varying vector field for efficient and fast trajectory generation. We demonstrate the effectiveness of T-CFM on three separate tasks: adversarial tracking, real-world aircraft trajectory forecasting, and long-horizon planning. Our model outperforms state-of-the-art baselines with an increase of 35% in predictive accuracy and 142% increase in planning performance. Notably, T-CFM achieves up to 100×\times× speed-up compared to diffusion-based models without sacrificing accuracy, which is crucial for real-time decision making in robotics.

I INTRODUCTION

Robots of the future will require fast and accurate trajectory forecasting techniques to navigate complex, dynamic environments and interact with other agents safely and efficiently. Trajectory forecasting deals with the problem of estimating an agent’s future behavior while trajectory generation deals with planning feasible paths for an agent to follow. These techniques are crucial for various robotics applications, such as autonomous driving [18], multi-robot coordination [24], and social navigation [3]. By generating long-horizon plans and accurately predicting the future trajectories of dynamic agents, robots can make better decisions and adapt to changing conditions in real-time.

Refer to caption
Figure 1: Trajectory Conditional Flow Matching (T-CFM) is our novel trajectory prediction and generation framework. The model is capable of generating trajectories in a single step, outperforming prior generative modeling work by learning a time-varying vector field to sample trajectories.

In recent years, deep learning approaches have achieved impressive results on trajectory forecasting benchmarks by learning complex patterns and distributions from large datasets. Compared to traditional methods such as Kalman Filters [13] and Particle Filters [6], learning-based methods excel at tasks where models of an agent’s behavior are unknown or hard to predict. In particular, generative models such as variational autoencoders (VAEs) and generative adversarial networks (GANs) have shown promise in modeling the inherent multimodality and stochasticity in agent behaviors. More recently, denoising diffusion probabilistic models (DDPMs) [20] have emerged as a powerful class of generative models, demonstrating superior performance in sample quality and diversity across various domains. However, a key limitation of diffusion models is their slow sampling speed, which hinders their real-time applicability in robotics. VAEs and GANs are fast but struggle with multimodal sample quality compared to diffusion models.

In this paper, we introduce a novel trajectory forecasting and generation framework that employs flow matching [14], a method that transforms between data distributions using a learned time-varying vector field. Our technique, named Trajectory Conditional Flow Matching (T-CFM), maintains the sample quality of diffusion models while generating samples an order of magnitude faster by circumventing the iterative sampling approach used in diffusion models.

We demonstrate the efficacy of T-CFM on three robotics tasks, shown in Figure 3. In the adversarial tracking scenario, autonomous pursuing agents predict the future trajectories of an adversarial evader. We also showcase T-CFM’s performance on a real-world aircraft trajectory prediction dataset, which has important implications for the development of autonomous aerial robots. Accurate trajectory prediction enables these robots to avoid collisions and coordinate with other aircrafts. Finally, we apply T-CFM to long-horizon planning in complex 2D maze environments [8], demonstrating our model’s ability to generate long-horizon plans, which is crucial for robot navigation.

We show that T-CFM outperforms state-of-the-art baselines, including diffusion models, in terms of predictive accuracy of generated trajectories and sample quality of generated plans. Notably, our approach can generate high-quality trajectory samples with as few as one sampling step, leading to significant speed-ups compared to diffusion models, without sacrificing performance.

Contributions: Our key contributions are three-fold.

  • We propose T-CFM, a novel flow matching framework for conditional trajectory forecasting and generation that is both accurate and efficient. To the best of our knowledge, we are the first to apply flow matching to trajectory prediction and trajectory planning tasks.

  • We demonstrate state-of-the-art performance on three challenging robotics tasks: adversarial tracking, aircraft trajectory forecasting, and long-horizon planning, achieving up to in 35% increase in prediction accuracy and 142% in planning performance.

  • T-CFM achieves significant sampling time speed-ups compared to prior generative modeling approaches, reducing sampling time by up to 100×. Our framework is versatile and can generate high-quality trajectories using as few as one sampling step or multi-step sampling when needed.

II Related Works

Traditionally, trajectory forecasting and trajectory planning have occupied very different and distinct avenues of robotics research. However, with the advent of deep learning, these tasks have become increasingly intertwined. In this section, we review traditional methods for both trajectory forecasting and generation (Section II-A). Then we describe why learning-based approaches can address the key limitations of traditional methods and introduce flow matching and generative modeling.

II-A Symbolic Methods for Trajectory Forecasting & Planning

Target Tracking

Target tracking is a well-studied problem in the robotics community [2], with numerous applications, including surveillance [9], crowd monitoring [21], and wildlife monitoring [7]. Traditional methods for target tracking and trajectory prediction, such as Kalman Filters [4, 13] and Particle Filters [6, 16], have been widely used in various in applications like autonomous navigation, object tracking in video surveillance, robotics, and radar systems. However, their performance degrades when faced with sparse observations, lack of accurate target behavior models, and long prediction horizons [17]. In our work, we address these limitations by leveraging a flow matching-based approach that learns to model the target’s behavior from data, allowing for accurate predictions even in sparse observation settings and over long horizons.

Planning and Navigation

Traditional methods for path planning, such as RRT* and PRM* [12], are widely used in environments with a known representation of the world. However, they can be computationally expensive for large state spaces and require a priori knowledge, which may not always be available or can change dynamically. Learning-based approaches, such as ours, offer the promise of generalization and the ability to provide solutions in dynamic environments. Our approach learns to generate feasible trajectories directly from data, eliminating the need for explicit maps and enabling fast planning in complex, dynamic environments.

II-B Learning-Based Approaches

Supervised Learning Methods for Trajectory Prediction

Recent works in trajectory prediction for various domains, such as aircraft navigation (FlightBERT) [10], social navigation [3], and autonomous driving [18], have employed supervised learning methods. These approaches often utilize autoregressive models and log-likelihood based training to learn predictive models from data. Graph-based Adversarial Modeling with Mutual Information (GrAMMI) [26] is a recent framework that explicitly models a multimodal distribution using a combination of a Gaussian Mixture Model regularized by mutual information. While effective in certain scenarios, these models can struggle with capturing long-horizon multi-modal distributions, which are common in real-world trajectory data. We leverage a generative modeling approach rather than a supervised learning one, enabling more diverse multimodal outputs for accurate prediction in complex real-world scenarios.

Generative Modeling

Generative modeling techniques, which have shown great success in computer vision tasks [20], provide a promising avenue to augment trajectory prediction by learning to model complex, multimodal distributions. Recently, diffusion-based probabilistic models [20] have dominated many generative modeling tasks. Diffusion models generate samples by iteratively denoising a Gaussian distribution, allowing them to capture complex, multimodal distributions. Diffuser [11], a diffusion based approach for learning trajectories, is state of the art in long horizon planning and offline reinforcement learning. Constrained Agent-based Diffusion for Enhanced Multi-Target Tracking (CADENCE) [25], similarly extends the Diffuser framework for Adversarial Tracking and modeling multi-agent behaviors. The main drawback of diffusion models is their iterative denoising process, which can be computationally slow, limiting their real-time applicability.

As an efficient alternative to diffusion models, flow matching techniques [14, 22] learn a generative model using ordinary differential equations (ODEs) instead of stochastic differential equations (SDEs). This formulation allows for faster sampling while maintaining the ability to model complex distributions. To the best of our knowledge, our work is the first to apply flow matching techniques for learning trajectories for prediction and planning tasks.

III Preliminaries

Trajectories play a crucial role in various robotics domains as they represent the behavior and evolution of an agent over time. Formally, we define a trajectory τ𝜏\tauitalic_τ as a sequence of states s1,s2,,sTsuperscript𝑠1superscript𝑠2superscript𝑠𝑇{s^{1},s^{2},\ldots,s^{T}}italic_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_s start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where st𝒮superscript𝑠𝑡𝒮s^{t}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ caligraphic_S represents the state of the agent at time horizon step t𝑡titalic_t, and 𝒮𝒮\mathcal{S}caligraphic_S is the state space. In some cases, trajectories may also include actions, represented as τ={(s1,a1),(s2,a2),,(sT,aT)}𝜏superscript𝑠1superscript𝑎1superscript𝑠2superscript𝑎2superscript𝑠𝑇superscript𝑎𝑇\tau=\{(s^{1},a^{1}),(s^{2},a^{2}),\ldots,(s^{T},a^{T})\}italic_τ = { ( italic_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , ( italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , … , ( italic_s start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) }, where at𝒜superscript𝑎𝑡𝒜a^{t}\in\mathcal{A}italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ caligraphic_A is the action taken by the agent at time step t𝑡titalic_t, and 𝒜𝒜\mathcal{A}caligraphic_A is the action space.

Refer to caption
Figure 2: Overview of Trajectory Conditional Flow Matching. The flow matching formulation defines intermediate trajectories as a linear combination between the prior noise distribution (τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) and data distribution (τ1)\tau_{1})italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). The sampling procedure then utilizes the learned flow field generated by the model to create samples.

By modeling trajectories in this general form, we can develop methods for trajectory forecasting and planning that are applicable across different domains. In trajectory forecasting, the goal is to predict an agent’s future trajectory given its past states and additional context information, c𝑐citalic_c. This can be formalized as learning a conditional distribution pθ(τt+1:T|τ1:t,c)subscript𝑝𝜃conditionalsuperscript𝜏:𝑡1𝑇superscript𝜏:1𝑡𝑐p_{\theta}(\tau^{t+1:T}|\tau^{1:t},c)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_t + 1 : italic_T end_POSTSUPERSCRIPT | italic_τ start_POSTSUPERSCRIPT 1 : italic_t end_POSTSUPERSCRIPT , italic_c ), where τ1:tsuperscript𝜏:1𝑡\tau^{1:t}italic_τ start_POSTSUPERSCRIPT 1 : italic_t end_POSTSUPERSCRIPT represents the observed trajectory up to time t𝑡titalic_t, τt+1:Tsuperscript𝜏:𝑡1𝑇\tau^{t+1:T}italic_τ start_POSTSUPERSCRIPT italic_t + 1 : italic_T end_POSTSUPERSCRIPT represents the future trajectory to be predicted, and c𝑐citalic_c represents any additional context. Similarly, in adversarial tracking, we can use the same framework to predict an adversary’s (τ)𝜏(\tau)( italic_τ ) future trajectory given prior detection or environmental information, c𝑐citalic_c.

In long-horizon planning, the goal is to generate a trajectory that leads an agent from an initial state to a goal state. This can be formalized as drawing a trajectory sample from pθ(τ|c)subscript𝑝𝜃conditional𝜏𝑐p_{\theta}(\tau|c)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ | italic_c ), where the context information, c𝑐citalic_c are the start and goal states. Rather than a trajectory optimization problem, this formulation allows us to view the problem as a conditional generation task, where we sample trajectories from a learned distribution conditioned on the desired start and goal states.

IV Method

In this section, we describe our flow matching formulation and how we model our problem. The goal primary goal is to generate trajectories τ𝜏\tauitalic_τ given the conditioning factor c𝑐citalic_c. Trajectories are defined simply as a sequence of states or a sequence of states and actions.

IV-A Flow Matching Formulation

The goal of flow matching, similar to diffusion models, is to learn a process that can generate samples through an iterative process that lies in the data distribution. To do this, we model the starting random Gaussian noise distribution as q(τ0)𝑞subscript𝜏0q(\tau_{0})italic_q ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and the trajectory data distribution as q(τ1)𝑞subscript𝜏1q(\tau_{1})italic_q ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). We refer to these distributions as q0,q1subscript𝑞0subscript𝑞1q_{0},q_{1}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT where the generative modeling task is to transform q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

To learn a model that can transform q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we model a time-varying vector field u:[0,1]×dd:𝑢01superscript𝑑superscript𝑑u:[0,1]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_u : [ 0 , 1 ] × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and a probability path p:[0,1]×dd+:𝑝01superscript𝑑superscriptlimit-from𝑑p:[0,1]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d+}italic_p : [ 0 , 1 ] × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d + end_POSTSUPERSCRIPT. The vector field u𝑢uitalic_u is defined by an ordinary differential equation (Equation 1).

dτ=ut(τ)dt𝑑𝜏subscript𝑢𝑡𝜏𝑑𝑡d\tau=u_{t}(\tau)dtitalic_d italic_τ = italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ ) italic_d italic_t (1)

Intuitively, the vector field defines the direction and magnitude to push each sample such that a sample from q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT arrives at its corresponding location in q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT by following the probability path p𝑝pitalic_p over time.

We aim to approximate the true vector field u𝑢uitalic_u using a neural network represented by vθ(t,τ)subscript𝑣𝜃𝑡𝜏v_{\theta}(t,\tau)italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t , italic_τ ), where vθ(t,τ)subscript𝑣𝜃𝑡𝜏v_{\theta}(t,\tau)italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t , italic_τ ) defines a time-dependent vector field parameterized by weights, θ𝜃\thetaitalic_θ. The flow matching objective is to minimize the difference between the predicted vector field vθ(t,τ)subscript𝑣𝜃𝑡𝜏v_{\theta}(t,\tau)italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t , italic_τ ) and the true vector field ut(τ)subscript𝑢𝑡𝜏u_{t}(\tau)italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ ), as expressed in Equation 2.

minθ𝔼t,τpt(τ)vθ(t,τ)ut(τ)2subscript𝜃subscript𝔼similar-to𝑡𝜏subscript𝑝𝑡𝜏superscriptnormsubscript𝑣𝜃𝑡𝜏subscript𝑢𝑡𝜏2\min_{\theta}\mathbb{E}_{t,\tau\sim p_{t}(\tau)}\left\|v_{\theta}(t,\tau)-u_{t% }(\tau)\right\|^{2}roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t , italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t , italic_τ ) - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (2)

However, this objective is intractable as there is no closed form representation for the true vector field ut(τ)subscript𝑢𝑡𝜏u_{t}(\tau)italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ ). Instead, prior work [14, 22, 1] has proposed to estimate the conditional form of the vector field ut(τ|z)subscript𝑢𝑡conditional𝜏𝑧u_{t}(\tau|z)italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ | italic_z ) which is conditioned on a random variable z𝑧zitalic_z. In our work, we use the formulation where q(z)=q(τ0)q(τ1)𝑞𝑧𝑞subscript𝜏0𝑞subscript𝜏1q(z)=q(\tau_{0})q(\tau_{1})italic_q ( italic_z ) = italic_q ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_q ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), meaning z𝑧zitalic_z captures the starting and ending points of the trajectory.

We assume a Gaussian flow between τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with standard deviation σ𝜎\sigmaitalic_σ and model the probability path pt(τ|z)subscript𝑝𝑡conditional𝜏𝑧p_{t}(\tau|z)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ | italic_z ) and vector field ut(τ|z)subscript𝑢𝑡conditional𝜏𝑧u_{t}(\tau|z)italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ | italic_z ) as shown in Equations IV-A. Equation 3a defines the probability path as a Gaussian distribution centered at a linear interpolation between τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT at time t𝑡titalic_t. In the top portion of Figure 2, we show a visualization of the linear interpolation used to generate intermediate trajectories between τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Equation 3b defines the target vector field simply as the difference vector pointing from the starting point τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to the end point τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

pt(τ|z)=𝒩(τ|tτ1+(1t)τ0,σ2)subscript𝑝𝑡conditional𝜏𝑧𝒩conditional𝜏𝑡subscript𝜏11𝑡subscript𝜏0superscript𝜎2p_{t}(\tau|z)=\mathcal{N}\left(\tau|t\tau_{1}+(1-t)\tau_{0},\sigma^{2}\right)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ | italic_z ) = caligraphic_N ( italic_τ | italic_t italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_t ) italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (3a)
ut(τ|z)=τ1τ0.subscript𝑢𝑡conditional𝜏𝑧subscript𝜏1subscript𝜏0u_{t}(\tau|z)=\tau_{1}-\tau_{0}.italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ | italic_z ) = italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (3b)

With this formulation, we now have a computable target vector field that we can regress our neural network to. Algorithm 1 summarizes the training steps:

  1. 1.

    Draw a starting trajectory τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the Gaussian noise distribution q(τ0)𝑞subscript𝜏0q(\tau_{0})italic_q ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and a random timestep t𝑡titalic_t from a uniform distribution (Lines 3 - 4).

  2. 2.

    Draw a ground truth end trajectory τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and conditioning factor c𝑐citalic_c from the dataset (Line 5).

  3. 3.

    Compute the intermediate trajectory τ𝜏\tauitalic_τ at time t𝑡titalic_t by linearly interpolating between τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (Line 6).

  4. 4.

    Match the vector field vθ(t,τ)subscript𝑣𝜃𝑡𝜏v_{\theta}(t,\tau)italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t , italic_τ ) predicted by the neural network to the target vector field ut(τ|z)=τ1τ0subscript𝑢𝑡conditional𝜏𝑧subscript𝜏1subscript𝜏0u_{t}(\tau|z)=\tau_{1}-\tau_{0}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ | italic_z ) = italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (Lines 7 - 9).

By repeating these steps and updating the neural network weights to minimize the difference between the predicted and target vector fields, the model learns to approximate the true time-dependent vector field that transforms samples from the starting noise distribution to the data distribution.

Algorithm 1 Conditional Flow Matching Training
1:Dataset 𝒟𝒟\mathcal{D}caligraphic_D, computable ut(x|z)subscript𝑢𝑡conditional𝑥𝑧u_{t}(x|z)italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x | italic_z ) and network vθ(t,τ,c)subscript𝑣𝜃𝑡𝜏𝑐v_{\theta}(t,\tau,c)italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t , italic_τ , italic_c ).
2:while Training do
3:     τ0q(τ0)similar-tosubscript𝜏0𝑞subscript𝜏0\tau_{0}\sim q(\tau_{0})italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ); \triangleright Draw source trajectory
4:     tU(0,1)similar-to𝑡𝑈01t\sim U(0,1)italic_t ∼ italic_U ( 0 , 1 ) \triangleright Draw timestep
5:     τ1,c𝒟similar-tosubscript𝜏1𝑐𝒟\tau_{1},c\sim\mathcal{D}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c ∼ caligraphic_D \triangleright Draw target trajectory and conditioning
6:     τpt(τ|z)=𝒩(τ|tτ1+(1t)τ0,σ2)similar-to𝜏subscript𝑝𝑡conditional𝜏𝑧𝒩conditional𝜏𝑡subscript𝜏11𝑡subscript𝜏0superscript𝜎2\tau\sim p_{t}(\tau|z)=\mathcal{N}\left(\tau|t\tau_{1}+(1-t)\tau_{0},\sigma^{2% }\right)italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ | italic_z ) = caligraphic_N ( italic_τ | italic_t italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_t ) italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) \triangleright Eq. 3a
7:     ut(τ|z)=τ1τ0subscript𝑢𝑡conditional𝜏𝑧subscript𝜏1subscript𝜏0u_{t}(\tau|z)=\tau_{1}-\tau_{0}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ | italic_z ) = italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT \triangleright Eq. 3b
8:     CFM(θ)=||vθ(t,τ,c)ut(τ|z)||2\mathcal{L}_{CFM}(\theta)=||v_{\theta}(t,\tau,c)-u_{t}(\tau|z)||^{2}caligraphic_L start_POSTSUBSCRIPT italic_C italic_F italic_M end_POSTSUBSCRIPT ( italic_θ ) = | | italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t , italic_τ , italic_c ) - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ | italic_z ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT \triangleright Loss
9:     θ=θ+αθCFM(θ)𝜃𝜃𝛼subscript𝜃subscript𝐶𝐹𝑀𝜃\theta=\theta+\alpha\nabla_{\theta}\mathcal{L}_{CFM}(\theta)italic_θ = italic_θ + italic_α ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C italic_F italic_M end_POSTSUBSCRIPT ( italic_θ ) \triangleright Update Model
10:end while

The neural network model used to parameterize vθ(t,τ,c)subscript𝑣𝜃𝑡𝜏𝑐v_{\theta}(t,\tau,c)italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t , italic_τ , italic_c ) is a 1D Convolutional Temporal U-Net based on prior diffusion work [11, 25]. 1D convolutions slide over the time dimension of the input trajectory τ𝜏\tauitalic_τ, capturing temporal patterns and dependencies without being autoregressive. This allows for efficient parallel processing of the entire trajectory. The model also incorporates Feature-Wise Linear Modulation (FiLM) Layers [15] to condition the model with relevant context information, c𝑐citalic_c. For each domain, we will describe the context vector (c𝑐citalic_c) used in Section V. By using the same base architecture as prior diffusion work, we demonstrate that our training methodology generates better models irrespective of model parameter count and architecture.

IV-B Sampling

Given a trained flow model vθ(t,τ,c)subscript𝑣𝜃𝑡𝜏𝑐v_{\theta}(t,\tau,c)italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t , italic_τ , italic_c ), the sampling procedure utilizes an ODE solver to recover the solution to Equation 1. We can denote the solution of the ODE with ϕt(τ)subscriptitalic-ϕ𝑡𝜏\phi_{t}(\tau)italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ ), where ϕ0(τ)=τsubscriptitalic-ϕ0𝜏𝜏\phi_{0}(\tau)=\tauitalic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_τ ) = italic_τ and ϕt(τ)subscriptitalic-ϕ𝑡𝜏\phi_{t}(\tau)italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ ) is the transformation of our trajectory τ𝜏\tauitalic_τ transported along the vector field from time 00 to time t𝑡titalic_t. In Algorithm 2, we show the sampling procedure using the Euler method (Line 9) but any off the shelf ODE solver can be used. In our experiments, we choose to use the Euler sampling method as the number of sampling steps is easily adjustable. The bottom portion of Figure 2 shows how the sampling procedure moves from a prior noisy trajectory τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to a trajectory that lies within the data distribution τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

One key difference between the trajectory generation and planning tasks is the planning task requires constraints on the sampled trajectory. We provide a formulation to constrain the generated trajectory τ𝜏\tauitalic_τ to start at the current robot state and end at the desired goal state. For each sampling step, we set these states of the trajectory rather than interpolate from noise (Alg 2, Line 6-7), where the horizon of the trajectory is denoted hhitalic_h. In this formulation, we allow the model to infill the trajectory to generate a cohesive plan.

Algorithm 2 Euler Sampling
1:Samplable q(τ0)=𝒩(0,I)𝑞subscript𝜏0𝒩0𝐼q(\tau_{0})=\mathcal{N}(0,I)italic_q ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( 0 , italic_I ), trained flow network vθ(t,τ,c)subscript𝑣𝜃𝑡𝜏𝑐v_{\theta}(t,\tau,c)italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t , italic_τ , italic_c ), number of sampling timesteps N𝑁Nitalic_N
2:Optional: Start State s0superscript𝑠0s^{0}italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, End State shsuperscript𝑠s^{h}italic_s start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT
3:τ0q(τ0)similar-tosubscript𝜏0𝑞subscript𝜏0\tau_{0}\sim q(\tau_{0})italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) \triangleright Sample τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
4:for t=1,,N𝑡1𝑁t=1,...,Nitalic_t = 1 , … , italic_N do
5:     if Planning then
6:         τt/N0s0superscriptsubscript𝜏𝑡𝑁0superscript𝑠0\tau_{t/N}^{0}\leftarrow s^{0}italic_τ start_POSTSUBSCRIPT italic_t / italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ← italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT \triangleright Set Start State
7:         τt/Nhshsuperscriptsubscript𝜏𝑡𝑁superscript𝑠\tau_{t/N}^{h}\leftarrow s^{h}italic_τ start_POSTSUBSCRIPT italic_t / italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ← italic_s start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT \triangleright Set Goal State
8:     end if
9:     τ(t+1)/Nτt/N+1Nvθ(t,τt/N,c)subscript𝜏𝑡1𝑁subscript𝜏𝑡𝑁1𝑁subscript𝑣𝜃𝑡subscript𝜏𝑡𝑁𝑐\tau_{(t+1)/N}\leftarrow\tau_{t/N}+\frac{1}{N}v_{\theta}(t,\tau_{t/N},c)italic_τ start_POSTSUBSCRIPT ( italic_t + 1 ) / italic_N end_POSTSUBSCRIPT ← italic_τ start_POSTSUBSCRIPT italic_t / italic_N end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t , italic_τ start_POSTSUBSCRIPT italic_t / italic_N end_POSTSUBSCRIPT , italic_c )
10:end for

The key feature of parameterizing the probability flow and vector field through Equation IV-A is that it enables our model to learn straight flows as compared to diffusion models. In diffusion models, the sampling process involves gradually denoising a Gaussian noise sample over many steps, following a complex path in the data space. This typically requires a large number of sampling steps to generate high-quality samples. In contrast, our flow matching approach learns a direct, straight path from the starting noise distribution to the target data distribution by modeling the probability path as a linear interpolation between the starting and ending points (Equation IV-A), encouraging the model to find the most efficient trajectory that matches the true data distribution.

Intuitively, the straight flows learned by our model can be thought of as a shortcut from the noise distribution to the data distribution. Instead of taking a meandering path through the data space, the model learns to follow a direct route guided by the target vector field, enabling our method to reduce the number of intermediate steps needed to generate high-quality trajectories. This straight path allows for faster sampling with fewer steps.

V Evaluation and Domains

We test our model in three different tasks and domains: 1) Adversarial Tracking 2) Trajectory Forecasting, and 3) Long-Horizon Planning. These domains test our model’s capability of generating accurate multimodal trajectory predictions and plans for robots to use. Visualizations for the training data and domains are shown in Figure 3.

Refer to caption
Figure 3: Trajectory Forecasting and Planning Domains: Our T-CFM framework is applicable many trajectory modeling tasks, with Adversarial Tracking, Trajectory Forecasting, and Long-Horizon Planning domains shown here.

V-A Adversarial Tracking

Adversarial tracking aims to predict an adversary’s future trajectory τ𝜏\tauitalic_τ given past historical information, c𝑐citalic_c. These domains are challenging due to the adversary’s potential multiple strategies and the observers’ often incomplete or sparse data. We assess our flow-matching tracking models using the Prison Escape and Smuggler scenarios, as introduced in previous work [26].

Refer to caption
(a) ADE vs Prediction Time-Horizon
Refer to caption
(b) ADE vs Total Sampling Steps
Figure 4: Comparison our model with the diffusion-based CADENCE model. Our method achieves better ADE on the entire prediction horizon (left) while also maintaining performance when the number of sampling steps is lowered (right).

The Prisoner Escape and Narco Traffic Interdiction simulations share similar pursuit-evasion dynamics, with tracking agents collaborating to locate and apprehend an adversary attempting to reach predetermined hideouts. The agents face the challenge of operating in large environments with sparse detections of the opponent. Key differences between the domains include the type of fog-of-war, agent capturing dynamics, and destination types. We refer the reader to prior work for more details [26].

For both scenarios, we utilize open-sourced datasets from prior work [26]. The Prison Escape scenario consists of three datasets (Prisoner-Low, Prisoner-Medium, Prisoner-High) with opponent detection rates of 12.9%percent12.912.9\%12.9 %, 44.0%percent44.044.0\%44.0 %, and 63.1%percent63.163.1\%63.1 %, respectively. The Narco Interdiction scenario uses two datasets with opponent detection rates of 13.8%percent13.813.8\%13.8 % and 31.5%percent31.531.5\%31.5 %, adjusted by modifying the pursuit agents’ detection radius. We evaluate our models using Average Displacement Error (ADE), which computes the average l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance between each sampled trajectory and the ground truth trajectory over all timesteps.

V-B Aircraft Trajectory Forecasting

To demonstrate the capabilities of our model on real data, we retrieved two years of data for a single Cessna aircraft from the OpenSky database [19]. Individual trajectories were extracted from the dataset, resulting in a total of 474 trajectories and a train/val/test split of 80/10/10% was used. The Cessna was chosen to constrain the range of trajectories while maintaining significant variability, allowing for testing the multimodal performance of our algorithm. The goal is to predict the future trajectory τ𝜏\tauitalic_τ given the 5-minute history of past states c𝑐citalic_c, which we use as our context vector. We evaluate the forecasting performance using two common metrics: mean absolute error (MAE) and root mean square error (RMSE) for longitude, latitude, and altitude.

V-C Long-Horizon Imitation Learning - Maze2D

Learning to plan for long horizons is crucial for robots to navigate autonomously in complex domains. The performance of our models in long-horizon planning is evaluated using the Maze2D environments [8]. In this task, the agent must traverse from a starting location to a goal location. The algorithm is tested on three maps of increasing difficulty: U-Maze, Medium, and Large. Following prior work, the performance is reported in terms of score, which represents the agent’s success in reaching the final goal. The score is normalized between 0 and 100 based on an expert policy.

Two different evaluations are performed: single-task and multi-task. In the single-task evaluation, the goal location remains constant while in the multi-task setting, the goal location is randomly selected at the beginning of each episode. Training data consists of successful trajectories between randomly selected start and end goals.

VI Results and Discussions

This section presents results and analysis for three tasks: Adversarial Tracking, Trajectory Forecasting, and Long-Horizon Planning. Three models were trained for each task using different random seeds.

VI-A Adversarial Tracking

Our approach is benchmarked against several state-estimation baselines including 1) VRNN [5], 2) GRaMMI [26] and 3) CADENCE [25].

VI-A1 Tracking Capabilities of Flow Matching

Prediction Horizon
0 min 30 min 60 min 90 min 120 min
P-Low VRNN 0.106 0.093 0.119 0.146 0.177
GrAMMI 0.060 0.080 0.110 0.154 0.163
CADENCE 0.057 0.077 0.100 0.127 0.154
T-CFM (Ours) 0.055 0.076 0.101 0.128 0.153
P-Med VRNN 0.172 0.086 0.110 0.144 0.167
GrAMMI 0.049 0.077 0.110 0.146 0.167
CADENCE 0.046 0.076 0.103 0.129 0.153
T-CFM (Ours) 0.030 0.058 0.088 0.118 0.146
P-High VRNN 0.105 0.059 0.100 0.117 0.145
GrAMMI 0.015 0.056 0.092 0.122 0.162
CADENCE 0.017 0.054 0.078 0.099 0.118
T-CFM (Ours) 0.018 0.044 0.067 0.089 0.110
NTI-Low VRNN 0.147 0.156 0.186 0.187 0.203
GrAMMI 0.121 0.144 0.181 0.183 0.193
CADENCE 0.112 0.123 0.135 0.148 0.160
T-CFM (Ours) 0.110 0.121 0.133 0.145 0.156
NTI-High VRNN 0.138 0.153 0.183 0.179 0.185
GrAMMI 0.131 0.163 0.174 0.175 0.184
CADENCE 0.113 0.124 0.138 0.152 0.163
T-CFM (Ours) 0.112 0.125 0.137 0.150 0.162
TABLE I: Average Displacement Error Results for three Prisoner Escape (P-low, P-med, P-high) and two Narco Traffic Interdiction (NTI-low, NTI-high) Datasets. Bolded values represent the best performing model.

We report our results on the three Prisoner Escape datasets and two Narco Interdiction datasets in Table I. We show that T-CFM outperform or matches the prior baselines on all prediction horizons with the greatest advantages on the Prisoner-Medium and Prisoner-High datasets, showcasing a 17% and 12% increase in predictive accuracy respectively. We hypothesize that the flow-matching models are able to better incorporate the dense detection history information than the diffusion models because the flow field is deterministic and does not include adding an additional noise component. This may benefit the flow matching models to generate more confident and correct trajectories as compared to the diffusion models.

We also show the ADE for the entire prediction horizon on the Prisoner-Medium dataset in Figure 3(a). The VRNN and GRaMMI models are not shown as they do not predict full trajectories and also are not as competitive as the diffusion baseline. We find that our flow matching model reduces the ADE over all time horizons and has a tighter standard deviation than the diffusion model. Furthermore, we observe a performance dip in CADENCE between the first and second prediction timesteps, characterized by the sudden increase in ADE. This occurs because CADENCE employs an inpainting formulation that sets the first timestep to the detected location, if available. Consequently, this formulation introduces a risk of discontinuities in the diffusion tracks. Our results show that our flow matching model does not encounter the same issue and can outperform the diffusion-based model even without an explicit inpainting formulation.

VI-A2 Sampling Speed Analysis

We analyze the accuracy of our model compared to the diffusion model by reducing the total number of sampling steps N𝑁Nitalic_N. In diffusion models, sampling steps refer to denoising steps, while in our method, they refer to Euler sampling steps. Both formulations require a neural network function call at each sampling step, making the number of sampling steps the primary bottleneck in reducing overall sampling time as the underlying neural network architecture is the same.

Lon MAE Lat MAE Alt MAE
00 15151515 30303030 00 15151515 30303030 00 15151515 30303030
FlightBERT 0.036 0.127 0.164 0.024 0.087 0.102 390.1 1060.8 1014.1
T-CFM (Ours) 0.010 0.098 0.130 0.006 0.067 0.075 145.3 853.3 782.6
Lon RMSE Lat RMSE Alt RMSE
00 15151515 30303030 00 15151515 30303030 00 15151515 30303030
FlightBERT 0.057 0.188 0.267 0.035 0.122 0.161 509.9 1430.6 1375.1
T-CFM (Ours) 0.014 0.160 0.234 0.009 0.107 0.141 205.5 1242.6 1135.4
TABLE II: Aircraft Trajectory Forecasting: T-CFM achieves lower MAE and RMSE on Latitude, Longitude, and Altitude.

We find that our flow-based model can generate high quality samples with just a single sampling step (Figure 3(b)). This is due to the difference between flow matching and diffusion objectives. Flow matching enforces a straight probability flow between the starting distribution q(τ0)𝑞subscript𝜏0q(\tau_{0})italic_q ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and the ending distribution q(τ1)𝑞subscript𝜏1q(\tau_{1})italic_q ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Consequently, while diffusion models may need to adjust the sample direction during denoising, the flow matching framework learns a good initial estimate of how to move samples from the noisy distribution, enabling sample generation without multiple steps.

VI-B Aircraft Trajectory Forecasting

The trajectory forecasting task tests our model’s generative capabilities on real-world data rather than simulated data. We compare our method against FlightBERT [10], a modern transformer-based framework built specifically for aircraft trajectory forecasting. We modify FlightBERT’s attention mechanism, as the original framework assumed access to aircraft velocities. We also train with a negative log-likelihood loss to better model the variance in our dataset.

Table II shows our model’s performance as compared to FlightBERT. We find that our method outperforms FlightBERT on all metrics with an average improvement of 35.4%percent35.435.4\%35.4 % over all metrics. We hypothesize two main reasons that our method outperforms FlightBERT. First, our method is not autoregressive generates the whole trajectory at once. This provides an advantage as errors may not accumulate over the prediction time horizon. FlightBERT was only tested for shorter horizon predictions. Meanwhile, we are interested in longer horizon predictions of up to 30 minutes compared to the 5 minute horizon for the dataset in FlightBERT. Second, we hypothesize that the multimodal capabilities of our model is important for our flight trajectory dataset. Unlike commercial flights, the behavior of the Cessna aircraft does not travel in straight paths and consists of multiple heading changes throughout its path. We show that T-CFM better models these diverse trajectories than prior work.

VI-C Long-Horizon Planning through Imitation Learning

We compare how well our models perform against Diffuser [11], the current state-of-the-art method for learning how to plan solely from data on the Maze2D task (Table III). We find that with just a single sampling step, our method significantly outperforms the diffusion models, achieving a 142% increase in score. Similar to our analysis in Adversarial Tracking, the linear probability flows allows us to immediately infer high quality samples. Additionally, we provide a visualization of the sampling procedure with just two sampling timesteps in Figure 5. Here we show that Diffuser produces a plan that intersects with the wall, as it is requires a large number of sampling steps to produce coherent plans. Meanwhile our method plans a collision free path. Additionally, the middle trajectory τ1/2subscript𝜏12\tau_{1/2}italic_τ start_POSTSUBSCRIPT 1 / 2 end_POSTSUBSCRIPT shows more coherence and has less spread in T-CFM than Diffuser. This supports the hypothesis that our T-CFM’s flow field is more efficient at transforming noisy trajectories into realistic ones.

Refer to caption
Figure 5: Visualization of Sampling Procedure between Diffuser (top) and T-CFM (bottom) in Maze2D-Medium. In just two sampling steps, we show that T-CFM can successfully plan a path between the start and end unlike Diffuser.

While T-CFM outperforms Diffuser in the U-Maze and Medium mazes with more sampling steps, Diffuser outperforms our model in the Large maze. T-CFM occasionally generates good plans but sometimes produce paths that collide with walls, lowering the overall score. Sampling a single trajectory from noise increases the chance of generating inaccessible plans compared to Diffuser. This may be because linear flows from flow matching struggle to correct certain noise initializations. In contrast, the diffusion model’s sampling procedure is less dependent on initial noise, allowing it to reason better in the larger domain.

Environment N=1 N=256
Diffuser T-CFM Diffuser T-CFM
(Ours) (Ours)
Maze2D U-Maze 50.7±plus-or-minus\pm±6.7 106.7±plus-or-minus\pm±2.7 112.5±plus-or-minus\pm±11.2 122.1±plus-or-minus\pm±1.4
Maze2D Medium 21.7±plus-or-minus\pm±13.5 112.2±plus-or-minus\pm±1.5 123.3±plus-or-minus\pm±1.6 123.8±plus-or-minus\pm±3.5
Maze2D Large 30.3±plus-or-minus\pm±6.9 111.0±plus-or-minus\pm±13.5 112.6±plus-or-minus\pm±16.3 104.3±plus-or-minus\pm±3.4
Single-Task Average 34.2 109.9 116.1 116.7
Multi2D U-Maze 69.8±plus-or-minus\pm±16.0 129.8±plus-or-minus\pm±3.0 127.3±plus-or-minus\pm±3.3 129.5±plus-or-minus\pm±0.9
Multi2D Medium 58.4±plus-or-minus\pm±5.3 116.5±plus-or-minus\pm±2.8 124.2±plus-or-minus\pm±1.2 126.5±plus-or-minus\pm±4.1
Multi2D Large 35.8±plus-or-minus\pm±4.2 121.7±plus-or-minus\pm±5.0 138.7±plus-or-minus\pm±5.9 127.3±plus-or-minus\pm±8.6
Multi-Task Average 54.7 122.7 130.1 127.8
TABLE III: The performance of T-CFM and Diffuser on the long-horizon Maze2D compared when given a single sampling timestep N=1𝑁1N=1italic_N = 1 and maximum sampling timesteps, N=256𝑁256N=256italic_N = 256. Our model (T-CFM) is able to drastically reduce the number of sampling steps required to generate feasible plans whereas the Diffuser model fails at N=1𝑁1N=1italic_N = 1.
Refer to caption
Figure 6: Compared to Diffuser, our T-CFM model does not drop in performance as we reduce the number of sampling steps on the Large Maze2D domain.

VII Limitations and Future Work

We currently do not explicitly reason about potential interaction between multiple agents. Future work includes extending our framework to other domains such as social navigation and trajectory prediction in autonomous driving, which require explicit reasoning of other agents. Additionally, further work could be used to increase the expressiveness of our flow matching models. Recent work on stochastic bridge matching [23] could be used to combine the deterministic ODE formulation of flow matching with the stochastic SDE formulation in diffusion models. Finally, future experiments on using T-CFM in dynamic situations can better show our model’s capabilities for real robot tasks.

VIII Conclusion

T-CFM is a novel approach for efficient trajectory forecasting and planning in robotics. By learning time-varying vector fields through flow matching, T-CFM achieves state-of-the-art performance on tasks like adversarial tracking, aircraft trajectory prediction, and long-horizon planning. T-CFM offers significant speed-ups compared to diffusion-based models without compromising accuracy, paving the way for more autonomous and responsive robots operating in complex, dynamic environments.

References

  • [1] Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations, 2023.
  • [2] Yaakov Bar-Shalom, X Rong Li, and Thiagalingam Kirubarajan. Estimation with applications to tracking and navigation: theory algorithms and software. John Wiley & Sons, 2004.
  • [3] Yuhang Che, Allison M. Okamura, and Dorsa Sadigh. Efficient and trustworthy social navigation via explicit and implicit robot–human communication. IEEE Transactions on Robotics, 36(3):692–707, 2020.
  • [4] Rong Chen and Jun S Liu. Mixture kalman filters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(3):493–508, 2000.
  • [5] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. Advances in neural information processing systems, 28, 2015.
  • [6] Petar M Djuric, Mahesh Vemula, and Mónica F Bugallo. Target tracking by particle filtering in binary sensor networks. IEEE Transactions on signal processing, 56(6):2229–2238, 2008.
  • [7] Matthew Dunbabin and Lino Marques. Robots for environmental monitoring: Significant advancements and applications. IEEE Robotics & Automation Magazine, 19(1):24–39, 2012.
  • [8] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2020.
  • [9] B. Grocholsky, J. Keller, V. Kumar, and G. Pappas. Cooperative air and ground surveillance. IEEE Robotics & Automation Magazine, 13(3):16–25, 2006.
  • [10] Dongyue Guo, Edmond Q. Wu, Yuankai Wu, Jianwei Zhang, Rob Law, and Yi Lin. Flightbert: Binary encoding representation for flight trajectory prediction. IEEE Transactions on Intelligent Transportation Systems, 24(2):1828–1842, 2023.
  • [11] Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, 2022.
  • [12] Sertac Karaman and Emilio Frazzoli. Sampling-based algorithms for optimal motion planning. The international journal of robotics research, 30(7):846–894, 2011.
  • [13] William F Leven and Aaron D Lanterman. Unscented kalman filters for multiple target tracking with symmetric measurement equations. IEEE Transactions on Automatic Control, 54(2):370–375, 2009.
  • [14] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023.
  • [15] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • [16] G Mallikarjuna Rao and Ch Satyanarayana. Visual object target tracking using particle filter: a survey. International Journal of Image, Graphics and Signal Processing, 5(6):1250, 2013.
  • [17] X. Rong Li and V.P. Jilkov. Survey of maneuvering target tracking. part i. dynamic models. IEEE Transactions on Aerospace and Electronic Systems, 39(4):1333–1364, 2003.
  • [18] Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 683–700. Springer, 2020.
  • [19] Matthias Schäfer, Martin Strohmeier, Vincent Lenders, Ivan Martinovic, and Matthias Wilhelm. Bringing up opensky: A large-scale ads-b sensor network for research. In Proceedings of the 13th IEEE/ACM International Symposium on Information Processing in Sensor Networks (IPSN), pages 83–94, April 2014.
  • [20] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  • [21] Pratap Tokekar, Volkan Isler, and Antonio Franchi. Multi-target visual tracking with aerial robots. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3067–3072.
  • [22] Alexander Tong, Kilian FATRAS, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport. Transactions on Machine Learning Research, 2024.
  • [23] Alexander Tong, Nikolay Malkin, Kilian Fatras, Lazar Atanackovic, Yanlei Zhang, Guillaume Huguet, Guy Wolf, and Yoshua Bengio. Simulation-free schrödinger bridges via score and flow matching. arXiv preprint 2307.03672, 2023.
  • [24] Zixuan Wu, Sean Ye, Manisha Natarajan, Letian Chen, Rohan Paleja, and Matthew C Gombolay. Adversarial search and tracking with multiagent reinforcement learning in sparsely observable environment. In 2023 International Symposium on Multi-Robot and Multi-Agent Systems (MRS), pages 43–49. IEEE, 2023.
  • [25] Sean Ye, Manisha Natarajan, Zixuan Wu, and Matthew C. Gombolay. Diffusion models for multi-target adversarial tracking. In 2023 International Symposium on Multi-Robot and Multi-Agent Systems (MRS), pages 142–148, 2023.
  • [26] Sean Ye, Manisha Natarajan, Zixuan Wu, Rohan Paleja, Letian Chen, and Matthew C Gombolay. Learning models of adversarial agent behavior under partial observability. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3688–3695. IEEE, 2023.