HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: cuted

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.10672v1 [cs.RO] 15 Mar 2024

Riemannian Flow Matching Policy for Robot Motion Learning

Max Braun11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Noémie Jaquier11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Leonel Rozo22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, and Tamim Asfour11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT This work was supported by the Carl Zeiss Foundation under the project JuBot and the European Union’s Horizon Europe Framework Programme under grant agreement No 101070596 (euROBIN). 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTInstitute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Germany. [email protected]22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTBosch Center for Artificial Intelligence. Renningen, Germany. [email protected]
Abstract

We introduce Riemannian Flow Matching Policies (RFMP), a novel model for learning and synthesizing robot visuomotor policies. RFMP leverages the efficient training and inference capabilities of flow matching methods. By design, RFMP inherits the strengths of flow matching: the ability to encode high-dimensional multimodal distributions, commonly encountered in robotic tasks, and a very simple and fast inference process. We demonstrate the applicability of RFMP to both state-based and vision-conditioned robot motion policies. Notably, as the robot state resides on a Riemannian manifold, RFMP inherently incorporates geometric awareness, which is crucial for realistic robotic tasks. To evaluate RFMP, we conduct two proof-of-concept experiments, comparing its performance against Diffusion Policies. Although both approaches successfully learn the considered tasks, our results show that RFMP provides smoother action trajectories with significantly lower inference times.

I INTRODUCTION

The problem of learning, synthesizing and adapting robot motions in unstructured environments has been recently disrupted by the rise of deep generative models. These models enable a robot to learn elaborated skills that may display high-dimensional multimodal action distributions. They can also be interfaced with deep multimodal perception networks, thus allowing a robot to learn sensorimotor policies. These models have been leveraged in both imitation and reinforcement settings [1, 2, 3, 4], where diffusion processes [5, 6] have recently shown promising results in a plethora of real robotic tasks. Nevertheless, this type of models are characterized by expensive inference methods as they often require to solve a stochastic differential equation, which might hinder their use in some robotic settings [5], e.g., for reactive motion policies. Moreover, when learning diffusion models on Riemannian manifolds, the computation of the score function of the diffusion process is not as simple as in the Euclidean case [7], and the inference process incurs increasing computational complexity.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Learned RFMP flows ( ) from the base distribution ( ) to the LASA datasets 𝖲𝖲\mathsf{S}sansserif_S and 𝖶𝖶\mathsf{W}sansserif_W ( ) on both 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (top) and 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (bottom). The flow is conditioned on random observations 𝒐𝒐\bm{o}bold_italic_o from the training dataset ( ).

In contrast to diffusion models, flow matching (FM) [8] takes a different approach. Intuitively, FM defines a series of small transformations (flows) that can smoothly move samples from a base distribution towards the target data points (i.e., the demonstration dataset). Each flow is represented by simple function that takes a base distribution sample and pushes it slightly in a specific direction. By chaining these small flows together, FM gradually transforms the prior distribution into the target (demonstrations) distribution. The beauty of FM lies in its simplicity, as these flow function is much easier to train and evaluate compared to solving complex stochastic differential equations as in diffusion models. Motivated by the recent efficacy of flow matching (FM) methods [8] across various machine learning domains [9, 10, 11], we propose to learn sensorimotor robot skills via a Riemannian Flow Matching Policy (RFMP). RFMP capitalizes on the easy training and fast inference of FM methods to learn and synthesize robot movements represented by end-effector pose trajectories. Our main contributions are twofold: (1) we pioneer the application of FM methods within sensorimotor robot policies learning, and (2) we empirically validate their effectiveness on the established benchmark LASA dataset [12].

Related work: The literature on robot policy learning is vast, and therefore we focus on approaches that design policies based on flow-based generative models. Normalizing flows [13] are arguably the first models to be broadly adapted as robot policy representations. The most common approach was to employ them as diffeomorphisms for learning stable dynamical systems [14, 15, 16], with extensions to Lie groups [17], and Riemannian manifolds [18]. A shortcoming of normalizing flows is their slow training due to the integration of the associated ODE. More recently, diffusion models have dominated the robot learning scene due to their more stable training and their ability to learn complex data distributions more accurately [6]. They have been primarily employed to learn motion planners [1] and complex control policies [4, 3, 2]. In contrast to the aforementioned works, our work leverages flow matching [8] to model robot motion policies. This choice stems from the inherent advantages of FM: it avoids the complex training procedures of normalizing flows and the computationally expensive inference of diffusion models. Furthermore, our method also accounts for full-pose trajectories by leveraging the recently developed Riemannian extension of FM models presented in [19].

II BACKGROUND

In this section, we provide a short background on Riemannian geometry, an overview of the general flow matching framework and its extension to Riemannian manifolds.

II-A Riemannian manifolds

Let us imagine a flexible and curved surface like a globe. A smooth manifold, mathematically denoted by \mathcal{M}caligraphic_M, can be intuitively conceptualized as a smaller patch on that surface. Locally, this patch looks flat, therefore resembling the Euclidean space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [20, 21]. But unlike the entire globe, the whole manifold may be curved or twisted, preventing it from being entirely flat. The smoothness of the manifold allows us to define directions and rates of change at each point, leading to tangent vectors in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The set of tangent vectors of all curves at 𝒙𝒙\bm{x}\in\mathcal{M}bold_italic_x ∈ caligraphic_M forms a d𝑑ditalic_d-dimensional affine subspace of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, known as the tangent space 𝒯𝒙subscript𝒯𝒙\mathcal{T}_{\bm{x}}\mathcal{M}caligraphic_T start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT caligraphic_M of \mathcal{M}caligraphic_M at 𝒙𝒙\bm{x}bold_italic_x. The collection of all such tangent spaces is called the tangent bundle 𝒯=𝒙{(𝒙,𝒖)|𝒖𝒯x}𝒯subscript𝒙conditional-set𝒙𝒖𝒖subscript𝒯𝑥\mathcal{T}\mathcal{M}=\bigcup_{\bm{x}\in\mathcal{M}}\left\{(\bm{x},\bm{u})|% \bm{u}\in\mathcal{T}_{x}\mathcal{M}\right\}caligraphic_T caligraphic_M = ⋃ start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_M end_POSTSUBSCRIPT { ( bold_italic_x , bold_italic_u ) | bold_italic_u ∈ caligraphic_T start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT caligraphic_M } It is possible to endow a smooth manifold \mathcal{M}caligraphic_M with a Riemannian metric, which is a family of inner products g𝒙:𝒯𝒙×𝒯𝒙:subscript𝑔𝒙subscript𝒯𝒙subscript𝒯𝒙g_{\bm{x}}:\mathcal{T}_{\bm{x}}\mathcal{M}\times\mathcal{T}_{\bm{x}}\mathcal{M% }\rightarrow\mathbb{R}italic_g start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT : caligraphic_T start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT caligraphic_M × caligraphic_T start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT caligraphic_M → blackboard_R associated to each point 𝒙𝒙\bm{x}\in\mathcal{M}bold_italic_x ∈ caligraphic_M. A Riemannian manifold (,g)𝑔(\mathcal{M},g)( caligraphic_M , italic_g ) is a smooth manifold endowed with a Riemannian metric g𝑔gitalic_g, that is a family of inner products g𝒙:𝒯𝒙×𝒯𝒙:subscript𝑔𝒙subscript𝒯𝒙subscript𝒯𝒙g_{\bm{x}}:\mathcal{T}_{\bm{x}}\mathcal{M}\times\mathcal{T}_{\bm{x}}\mathcal{M% }\rightarrow\mathbb{R}italic_g start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT : caligraphic_T start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT caligraphic_M × caligraphic_T start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT caligraphic_M → blackboard_R associated to each point 𝒙𝒙\bm{x}\in\mathcal{M}bold_italic_x ∈ caligraphic_M [21].

To operate with Riemannian manifolds, we can leverage their Euclidean tangent spaces and resort to map**s back and forth between 𝒯𝒙subscript𝒯𝒙\mathcal{T}_{\bm{x}}\mathcal{M}caligraphic_T start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT caligraphic_M and \mathcal{M}caligraphic_M, using the exponential and logarithmic maps. Specifically, the exponential map Exp𝒙(𝒖):𝒯𝒙:subscriptExp𝒙𝒖subscript𝒯𝒙\text{Exp}_{\bm{x}}\left(\bm{u}\right):\mathcal{T}_{\bm{x}}\mathcal{M}\to% \mathcal{M}Exp start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ( bold_italic_u ) : caligraphic_T start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT caligraphic_M → caligraphic_M maps a point 𝒖𝒯𝒙𝒖subscript𝒯𝒙\bm{u}\in\mathcal{T}_{\bm{x}}\mathcal{M}bold_italic_u ∈ caligraphic_T start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT caligraphic_M to a point 𝒚𝒚\bm{y}bold_italic_y on the manifold, so that it lies on the geodesic starting at 𝒙𝒙\bm{x}bold_italic_x in the direction 𝒖𝒖\bm{u}bold_italic_u and such that the geodesic distance d(𝒙,𝒚)=d(𝒙,𝒖)subscript𝑑𝒙𝒚subscript𝑑𝒙𝒖d_{\mathcal{M}}(\bm{x},\bm{y})=d_{\mathbb{R}}(\bm{x},\bm{u})italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_y ) = italic_d start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_u ). The inverse operation is the logarithmic map Log𝒙(𝒚):𝒯𝒙:subscriptLog𝒙𝒚subscript𝒯𝒙\text{Log}_{\bm{x}}\left(\bm{y}\right):\mathcal{M}\to\mathcal{T}_{\bm{x}}% \mathcal{M}Log start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ( bold_italic_y ) : caligraphic_M → caligraphic_T start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT caligraphic_M . Finally, the parallel transport Γ𝒙𝒚(𝒖):𝒯𝒙𝒯𝒚:subscriptΓ𝒙𝒚𝒖subscript𝒯𝒙subscript𝒯𝒚\Gamma_{\bm{x}\rightarrow\bm{y}}(\bm{u}):\mathcal{T}_{\bm{x}}\mathcal{M}\to% \mathcal{T}_{\bm{y}}\mathcal{M}roman_Γ start_POSTSUBSCRIPT bold_italic_x → bold_italic_y end_POSTSUBSCRIPT ( bold_italic_u ) : caligraphic_T start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT caligraphic_M → caligraphic_T start_POSTSUBSCRIPT bold_italic_y end_POSTSUBSCRIPT caligraphic_M describes how elements of \mathcal{M}caligraphic_M can be transported along curves on \mathcal{M}caligraphic_M while maintaining their intrinsic geometric properties. This allows us to operate elements lying on different tangent spaces.

II-B Flow Matching

Flow matching [8] is a simulation-free generative model that reshapes a simple base density p0(d)subscript𝑝0superscript𝑑p_{0}\in\mathbb{P}(\mathbb{R}^{d})italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) to a target (more complicated) distribution p1(d)subscript𝑝1superscript𝑑p_{1}\in\mathbb{P}(\mathbb{R}^{d})italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) via the push-forward of the prior p1=ϕp0subscript𝑝1subscriptitalic-ϕsubscript𝑝0p_{1}=\phi_{\sharp}p_{0}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT ♯ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, with ϕitalic-ϕ\phiitalic_ϕ denoting the flow and \sharp being the push-forward operator. To design this flow, we can define a vector field ut:[0,1]×dd:subscript𝑢𝑡01superscript𝑑superscript𝑑u_{t}:[0,1]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : [ 0 , 1 ] × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that represents the ODE,

dϕt(𝒙)dt=ut(ϕt(𝒙))with initial conditionϕ0(𝒙)=𝒙.formulae-sequence𝑑subscriptitalic-ϕ𝑡𝒙𝑑𝑡subscript𝑢𝑡subscriptitalic-ϕ𝑡𝒙with initial conditionsubscriptitalic-ϕ0𝒙𝒙\frac{d\phi_{t}(\bm{x})}{dt}=u_{t}(\phi_{t}(\bm{x}))\quad\text{with initial % condition}\quad\phi_{0}(\bm{x})=\bm{x}.divide start_ARG italic_d italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) end_ARG start_ARG italic_d italic_t end_ARG = italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) ) with initial condition italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_x ) = bold_italic_x . (1)

Loosely speaking, the vector field utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT defines how a sample 𝒙0p0similar-tosubscript𝒙0subscript𝑝0\bm{x}_{0}\sim p_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is transformed over time (from t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to t=1𝑡1t=1italic_t = 1) to match a target sample from 𝒙1p1similar-tosubscript𝒙1subscript𝑝1\bm{x}_{1}\sim p_{1}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. At a density level, the vector field defines a probability density path pt:[0,1]×dd:subscript𝑝𝑡01superscript𝑑superscript𝑑p_{t}:[0,1]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : [ 0 , 1 ] × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, i.e. an interpolation in probability space, which is characterized by the continuity equation [8].

Assuming that both the probability path pt(𝒙)subscript𝑝𝑡𝒙p_{t}(\bm{x})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) and the corresponding vector field ut(𝒙)subscript𝑢𝑡𝒙u_{t}(\bm{x})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) are known, then one could regress a parametrized vector field vt(;𝜽):[0,1]×dd:subscript𝑣𝑡𝜽01superscript𝑑superscript𝑑v_{t}(\cdot;\bm{\theta}):[0,1]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ; bold_italic_θ ) : [ 0 , 1 ] × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to some target vector field utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which leads to the FM loss,

FM(𝜽)=𝔼t,pt(𝒙)vt(𝒙;𝜽)ut(𝒙)22,subscriptFM𝜽subscript𝔼𝑡subscript𝑝𝑡𝒙superscriptsubscriptnormsubscript𝑣𝑡𝒙𝜽subscript𝑢𝑡𝒙22\mathcal{L}_{\text{FM}}(\bm{\theta})=\mathbb{E}_{t,p_{t}(\bm{x})}\|v_{t}(\bm{x% };\bm{\theta})-u_{t}(\bm{x})\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT FM end_POSTSUBSCRIPT ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_t , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) end_POSTSUBSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ; bold_italic_θ ) - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (2)

where 𝜽𝜽\bm{\theta}bold_italic_θ are the learnable parameters, t𝒰[0,1]similar-to𝑡𝒰01t\sim\mathcal{U}[0,1]italic_t ∼ caligraphic_U [ 0 , 1 ], and 𝒙pt(𝒙)similar-to𝒙subscript𝑝𝑡𝒙\bm{x}\sim p_{t}(\bm{x})bold_italic_x ∼ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ). Unfortunately, the objective in (2) is intractable since we actually do not have prior knowledge about ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Lipman et al. [8] showed that by defining a conditional probability path pt(𝒙|𝒛)subscript𝑝𝑡conditional𝒙𝒛p_{t}(\bm{x}|\bm{z})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ) (and consequently a conditional vector field ut(𝒙|𝒛)subscript𝑢𝑡conditional𝒙𝒛u_{t}(\bm{x}|\bm{z})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ), it is possible to obtain a tractable conditional flow matching (CFM) loss,

CFM(𝜽)=𝔼t,q(𝒛),pt(𝒙|𝒛)vt(𝒙;𝜽)ut(𝒙|𝒛)22,\mathcal{L}_{\text{CFM}}(\bm{\theta})=\mathbb{E}_{t,q(\bm{z}),p_{t}(\bm{x}|\bm% {z})}\|v_{t}(\bm{x};\bm{\theta})-u_{t}(\bm{x}|\bm{z})\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT CFM end_POSTSUBSCRIPT ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_t , italic_q ( bold_italic_z ) , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ) end_POSTSUBSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ; bold_italic_θ ) - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3)

which happens to have identical gradients to the unconditional loss (2) w.r.t 𝜽𝜽\bm{\theta}bold_italic_θ.

Tong et al. [22] showed that there exist several forms of CFM depending on how we design the prior q(𝒛)𝑞𝒛q(\bm{z})italic_q ( bold_italic_z ), the conditional probability pt(𝒙|𝒛)subscript𝑝𝑡conditional𝒙𝒛p_{t}(\bm{x}|\bm{z})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ), and the corresponding vector field u(𝒙|𝒛)𝑢conditional𝒙𝒛u(\bm{x}|\bm{z})italic_u ( bold_italic_x | bold_italic_z ). For example, by considering the conditioning variable as 𝒛=𝒙1p1𝒛subscript𝒙1similar-tosubscript𝑝1\bm{z}=\bm{x}_{1}\sim p_{1}bold_italic_z = bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and by defining pt(𝒙|𝒛)subscript𝑝𝑡conditional𝒙𝒛p_{t}(\bm{x}|\bm{z})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ) and u(𝒙|𝒛)𝑢conditional𝒙𝒛u(\bm{x}|\bm{z})italic_u ( bold_italic_x | bold_italic_z ) as follows,

pt(𝒙|𝒛)subscript𝑝𝑡conditional𝒙𝒛\displaystyle p_{t}(\bm{x}|\bm{z})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ) =𝒩(𝒙|t𝒙1,(tσt+1)2),absent𝒩conditional𝒙𝑡subscript𝒙1superscript𝑡𝜎𝑡12\displaystyle=\mathcal{N}\left(\bm{x}|t\bm{x}_{1},(t\sigma-t+1)^{2}\right),= caligraphic_N ( bold_italic_x | italic_t bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ( italic_t italic_σ - italic_t + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (4)
ut(𝒙|𝒛)subscript𝑢𝑡conditional𝒙𝒛\displaystyle u_{t}(\bm{x}|\bm{z})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ) =𝒙1(1σ)𝒙1(1σ)t,absentsubscript𝒙11𝜎𝒙11𝜎𝑡\displaystyle=\frac{\bm{x}_{1}-(1-\sigma)\bm{x}}{1-(1-\sigma)t},= divide start_ARG bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ( 1 - italic_σ ) bold_italic_x end_ARG start_ARG 1 - ( 1 - italic_σ ) italic_t end_ARG , (5)

one recovers the Gaussian CFM of Lipman et al. [8], which defines a probability path from a zero-mean normal distribution to a Gaussian distribution centered at 𝒙1subscript𝒙1\bm{x}_{1}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which is also the approach taken in this paper. Finally, the inference process boils down to: (1) Get a sample from p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and; (2) Use the vector field vt(𝒙;𝜽)subscript𝑣𝑡𝒙𝜽v_{t}(\bm{x};\bm{\theta})italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ; bold_italic_θ ) to solve the ODE (1) using off-the-shelf solvers.

The Riemannian case: In several robotic settings, the target data distribution may lie on a Riemannian manifold \mathcal{M}caligraphic_M, as the desired movements for a robot’s end-effector often include the orientation component. Therefore, the part of the robot state representation lie on either the 𝒮3superscript𝒮3\mathcal{S}^{3}caligraphic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT hypersphere or the SO(3)SO3\operatorname{SO}(3)roman_SO ( 3 ) group, depending on the specific parametrization used. To properly handle this type of cases, Chen and Lipman [19] recently extended CFM to Riemannian manifolds (RCFM). Formally, this Riemannian formulation considers that 𝒙𝒙\bm{x}\in\mathcal{M}bold_italic_x ∈ caligraphic_M, and therefore the vector field ut(𝒙)𝒯𝒙subscript𝑢𝑡𝒙subscript𝒯𝒙u_{t}(\bm{x})\in\mathcal{T}_{\bm{x}}\mathcal{M}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) ∈ caligraphic_T start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT caligraphic_M (i.e., it evolves on the tangent bundle 𝒯𝒯\mathcal{TM}caligraphic_T caligraphic_M) generates a probability density path pt()subscript𝑝𝑡p_{t}\in\mathbb{P}(\mathcal{M})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_P ( caligraphic_M ). As stated previously, a Riemannian manifold \mathcal{M}caligraphic_M is endowed with a Riemannian metric g𝒙subscript𝑔𝒙g_{\bm{x}}italic_g start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT, which implies that the CFM loss (3) is now computed w.r.t such a metric, as follows,

RCFM(𝜽)=𝔼t,q(𝒛),pt(𝒙|𝒛)vt(𝒙;𝜽)ut(𝒙|𝒛)g𝒙2.\mathcal{L}_{\text{RCFM}}(\bm{\theta})=\mathbb{E}_{t,q(\bm{z}),p_{t}(\bm{x}|% \bm{z})}\|v_{t}(\bm{x};\bm{\theta})-u_{t}(\bm{x}|\bm{z})\|_{g_{\bm{x}}}^{2}.caligraphic_L start_POSTSUBSCRIPT RCFM end_POSTSUBSCRIPT ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_t , italic_q ( bold_italic_z ) , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ) end_POSTSUBSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ; bold_italic_θ ) - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ) ∥ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (6)

As in the Euclidean case, we need to design the vector field and choose the base distribution. Following [19, 11], the most straightforward strategy is to exploit geodesic paths to design the flow ϕitalic-ϕ\phiitalic_ϕ, i.e., we use the shortest path to connect 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒙1subscript𝒙1\bm{x}_{1}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Importantly, for many known geometries such as the 𝒮dsuperscript𝒮𝑑\mathcal{S}^{d}caligraphic_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT hypersphere, the SO(3)SO3\operatorname{SO}(3)roman_SO ( 3 ) group, or the manifold of symmetric positive definite matrices 𝒮++dsuperscriptsubscript𝒮absent𝑑\mathcal{S}_{++}^{d}caligraphic_S start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, to mention a few, we have closed-form geodesics. Specifically, the geodesic flow connecting 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒙1subscript𝒙1\bm{x}_{1}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is given by,

𝒙t=Exp𝒙0(tLog𝒙0(𝒙1)),t[0,1].formulae-sequencesubscript𝒙𝑡subscriptExpsubscript𝒙0𝑡subscriptLogsubscript𝒙0subscript𝒙1𝑡01\bm{x}_{t}=\text{Exp}_{\bm{x}_{0}}\left(t\,\text{Log}_{\bm{x}_{0}}\left(\bm{x}% _{1}\right)\right),\quad t\in[0,1].bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = Exp start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t Log start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , italic_t ∈ [ 0 , 1 ] . (7)

We now can design the vector field ut(𝒙|𝒛)subscript𝑢𝑡conditional𝒙𝒛u_{t}(\bm{x}|\bm{z})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ) by leveraging the ODE associated with the conditional flow dϕt(𝒙)/dt=𝒙˙t=ut(𝒙|𝒛)𝑑subscriptitalic-ϕ𝑡𝒙𝑑𝑡subscript˙𝒙𝑡subscript𝑢𝑡conditional𝒙𝒛d\phi_{t}(\bm{x})/dt=\dot{\bm{x}}_{t}=u_{t}(\bm{x}|\bm{z})italic_d italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) / italic_d italic_t = over˙ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ). This means that computing the vector field ut(𝒙|𝒛)subscript𝑢𝑡conditional𝒙𝒛u_{t}(\bm{x}|\bm{z})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_z ) corresponds to compute the time derivative of (7). Finally, the choice of the base distribution p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT generally depends on the problem at hand. One could directly define the base density as a uniform distribution over \mathcal{M}caligraphic_M as in [19, 11], but it is also possible to use Riemannian or wrapped Gaussian distributions.

III The Riemannian Flow Matching Policy

Given a set of trajectories 𝒟={𝒐n,𝒂n}n=1N𝒟superscriptsubscriptsubscript𝒐𝑛subscript𝒂𝑛𝑛1𝑁\mathcal{D}=\{\bm{o}_{n},\bm{a}_{n}\}_{n=1}^{N}caligraphic_D = { bold_italic_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where 𝒐𝒐\bm{o}bold_italic_o denotes the observation and 𝒂𝒂\bm{a}bold_italic_a represents the corresponding action, our goal is to leverage the CFM framework to learn a Riemannian flow matching policy (RFMP) π𝜽(𝒂|𝒐)subscript𝜋𝜽conditional𝒂𝒐\pi_{\bm{\theta}}(\bm{a}|\bm{o})italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_a | bold_italic_o ). This policy aims to generate action sequences that adhere to the target (expert) distribution πesubscript𝜋𝑒\pi_{e}italic_π start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. Note that, in the general case, we assume that both 𝒐,𝒂𝒐𝒂\bm{o},\bm{a}\in\mathcal{M}bold_italic_o , bold_italic_a ∈ caligraphic_M. We hereinafter explain how we leverage CFM to model, train, and use such a policy.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) RFMP trajectories.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) DP trajectories.
Figure 2: Demonstrations ( ) and learned trajectories on the LASA datasets 𝖲𝖲\mathsf{S}sansserif_S in 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (left), on the LASA datasets 𝖲𝖲\mathsf{S}sansserif_S, 𝖶𝖶\mathsf{W}sansserif_W projected on 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (middle-left, middle-right), and on a multimodal dataset made of mirrored datasets of the letter 𝖫𝖫\mathsf{L}sansserif_L projected on 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (right). Reproductions start at the same initial observations as the demonstrations ( ), or from randomly-sampled observations in the demonstration dataset neighborhood ( ). Trajectory starts are depicted by dots in the multimodal case.

III-A RFMP training

Firstly, we adapt RCFM to visuomotor policies by simply conditioning the parametrized vector field on the observation vector 𝒐tsubscript𝒐𝑡\bm{o}_{t}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, that is vt(𝒂|𝒐)subscript𝑣𝑡conditional𝒂𝒐v_{t}(\bm{a}|\bm{o})italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_a | bold_italic_o ). Secondly, inspired by the diffusion policies framework [4], we employ a receding horizon to achieve temporal consistency and smoothness on the predicted actions. This means that our predicted action horizon vector is constructed as 𝒂=[𝒂τ,𝒂τ+1,,𝒂τ+Ta]𝒂subscript𝒂𝜏subscript𝒂𝜏1subscript𝒂𝜏subscript𝑇𝑎\bm{a}=[\bm{a}_{\tau},\bm{a}_{\tau+1},\ldots,\bm{a}_{\tau+T_{a}}]bold_italic_a = [ bold_italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT , … , bold_italic_a start_POSTSUBSCRIPT italic_τ + italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] for a Tasubscript𝑇𝑎T_{a}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT-steps prediction horizon. This implies that all samples 𝒂1subscript𝒂1\bm{a}_{1}bold_italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT drawn from the target distribution have the form of the action horizon vector 𝒂𝒂\bm{a}bold_italic_a. Moreover, the samples 𝒂0subscript𝒂0\bm{a}_{0}bold_italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the base distribution are constructed as 𝒂0=[𝒂p0,,𝒂p0]subscript𝒂0subscript𝒂subscript𝑝0subscript𝒂subscript𝑝0\bm{a}_{0}=[\bm{a}_{p_{0}},\ldots,\bm{a}_{p_{0}}]bold_italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = [ bold_italic_a start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , bold_italic_a start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] with 𝒂p0p0similar-tosubscript𝒂subscript𝑝0subscript𝑝0\bm{a}_{p_{0}}\sim p_{0}bold_italic_a start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Nevertheless, instead of defining a similar receding horizon for the observations, we randomly sample only Tosubscript𝑇𝑜T_{o}italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT observation vectors from the training dataset to construct the conditioning variable 𝒐𝒐\bm{o}bold_italic_o. To do so, we follow the sampling strategy proposed in [9], which uses: (1) A reference observation 𝒐τ1subscript𝒐𝜏1\bm{o}_{\tau-1}bold_italic_o start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT ; (2) A context observation 𝒐csubscript𝒐𝑐\bm{o}_{c}bold_italic_o start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT with the index c𝑐citalic_c uniformly sampled from {1,,τ2}1𝜏2\{1,\ldots,\tau-2\}{ 1 , … , italic_τ - 2 }; and (3) The distance τc𝜏𝑐\tau-citalic_τ - italic_c between the prediction and the context observation. The combination of a reference and a context observation overcomes the fact that a single observation carries very little information and provides additional information about the direction of the motion. Therefore, the observation vector is defined as 𝒐=[𝒐τ1,𝒐c,τc]𝒐subscript𝒐𝜏1subscript𝒐𝑐𝜏𝑐\bm{o}=[\bm{o}_{\tau-1},\bm{o}_{c},\tau-c]bold_italic_o = [ bold_italic_o start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_τ - italic_c ]. The aforementioned strategy leads to the following RFMP loss,

RFMP(𝜽)=𝔼t,q(𝒂1),pt(𝒂|𝒂1)vt(𝒂|𝒐;𝜽)ut(𝒂|𝒂1)g𝒂2.\mathcal{L}_{\text{RFMP}}(\bm{\theta})=\mathbb{E}_{t,q(\bm{a}_{1}),p_{t}(\bm{a% }|\bm{a}_{1})}\|v_{t}(\bm{a}|\bm{o};\bm{\theta})-u_{t}(\bm{a}|\bm{a}_{1})\|_{g% _{\bm{a}}}^{2}.caligraphic_L start_POSTSUBSCRIPT RFMP end_POSTSUBSCRIPT ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_t , italic_q ( bold_italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_a | bold_italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_a | bold_italic_o ; bold_italic_θ ) - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_a | bold_italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT bold_italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (8)

Algorithm 1 summarizes the training procedure of RFMP.

Input : Initial parameters 𝜽𝜽\bm{\theta}bold_italic_θ, base and target distributions q(𝒂1),p(𝒂1)𝑞subscript𝒂1𝑝subscript𝒂1q(\bm{a}_{1}),p(\bm{a}_{1})italic_q ( bold_italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_p ( bold_italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).
Output : Regressed vector field parameters 𝜽𝜽\bm{\theta}bold_italic_θ.
1 while termination condition do
2       Sample time step t𝒰similar-to𝑡𝒰t\sim\mathcal{U}italic_t ∼ caligraphic_U. Sample training example 𝒂1psimilar-tosubscript𝒂1𝑝\bm{a}_{1}\sim pbold_italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_p, and noise 𝒂0qsimilar-tosubscript𝒂0𝑞\bm{a}_{0}\sim qbold_italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q. Sample observation vector 𝒐𝒐\bm{o}bold_italic_o. Compute target vector field 𝒂˙t=ut(𝒂|𝒂1)subscript˙𝒂𝑡subscript𝑢𝑡conditional𝒂subscript𝒂1\dot{\bm{a}}_{t}=u_{t}(\bm{a}|\bm{a}_{1})over˙ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_a | bold_italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) based on the geodesic flow (7). Evaluate RFMP(𝜽)subscriptRFMP𝜽\mathcal{L}_{\text{RFMP}}(\bm{\theta})caligraphic_L start_POSTSUBSCRIPT RFMP end_POSTSUBSCRIPT ( bold_italic_θ ) (8). Update parameters 𝜽𝜽\bm{\theta}bold_italic_θ.
3 end while
Algorithm 1 Riemannian Flow Matching Policy

III-B RFMP inference

Once our RFMP is trained, the inference process, which corresponds to querying our policy π𝜽(𝒂|𝒐)subscript𝜋𝜽conditional𝒂𝒐\pi_{\bm{\theta}}(\bm{a}|\bm{o})italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_a | bold_italic_o ), is carried out as follows: (1) Draw a sample 𝒂0qsimilar-tosubscript𝒂0𝑞\bm{a}_{0}\sim qbold_italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q; (2) Employ an off-the-shelf ODE solver to integrate the learned vector field vt(𝒂|𝒐;𝜽)subscript𝑣𝑡conditional𝒂𝒐𝜽v_{t}(\bm{a}|\bm{o};\bm{\theta})italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_a | bold_italic_o ; bold_italic_θ ) along the time interval [0,1]01[0,1][ 0 , 1 ]; (3) Execute only the first Tesubscript𝑇𝑒T_{e}italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT actions [𝒂τ,𝒂τ+1,,𝒂τ+Tesubscript𝒂𝜏subscript𝒂𝜏1subscript𝒂𝜏subscript𝑇𝑒\bm{a}_{\tau},\bm{a}_{\tau+1},\ldots,\bm{a}_{\tau+T_{e}}bold_italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT , … , bold_italic_a start_POSTSUBSCRIPT italic_τ + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT] with Te<Tasubscript𝑇𝑒subscript𝑇𝑎T_{e}<T_{a}italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT < italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, from the whole predicted action horizon 𝒂𝒂\bm{a}bold_italic_a. Note that the ODE solver queries the learned vector field vt(𝒂|𝒐;𝜽)subscript𝑣𝑡conditional𝒂𝒐𝜽v_{t}(\bm{a}|\bm{o};\bm{\theta})italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_a | bold_italic_o ; bold_italic_θ ) using the observation vector 𝒐=[𝒐τ1,𝒐c,τc]𝒐subscript𝒐𝜏1subscript𝒐𝑐𝜏𝑐\bm{o}=[\bm{o}_{\tau-1},\bm{o}_{c},\tau-c]bold_italic_o = [ bold_italic_o start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_τ - italic_c ] with c𝒰{1,,τ2}similar-to𝑐𝒰1𝜏2c\sim\mathcal{U}\{1,\ldots,\tau-2\}italic_c ∼ caligraphic_U { 1 , … , italic_τ - 2 }. In the Euclidean case, we use the DOPRI ODE solver [23] implemented in torchdyn [24]. In the Riemannian case, we employ a Riemannian ODE solver based on the Euler method, as in [19].

III-C RFMP implementation

Our RFMP implementation builds on the RFM framework from Chen and Lipman [19]. Specifically, we parameterized the vector field vt(𝒂|𝒐;𝜽)subscript𝑣𝑡conditional𝒂𝒐𝜽v_{t}(\bm{a}|\bm{o};\bm{\theta})italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_a | bold_italic_o ; bold_italic_θ ) using a standard multilayer perceptron (MLP) with 64646464 hidden units and 5555 layers for all experiments reported in the sequel. We used the Swish activation function [25] with a learnable parameter. The input to the MLP network is a vector formed as the concatenation of time and the observation vector. Under the aforementioned configuration, the resulting model has a total of 32323232K learnable parameters. We optimized the network parameters 𝜽𝜽\bm{\theta}bold_italic_θ using Adam with a learning rate of 1e41𝑒41e-41 italic_e - 4 and an exponential moving averaging on the weights [26] with a decay of 0.9990.9990.9990.999. For all experiments, we split the data as 80%percent8080\%80 % train, 10%percent1010\%10 % validation, and 10%percent1010\%10 % test. We trained the network for 200200200200 epochs and selected the best model based on its performance on the validation set.

Concerning the base distribution, our RFMP uses a Gaussian distribution p0=𝒩(𝟎,σ𝑰)subscript𝑝0𝒩0𝜎𝑰p_{0}=\mathcal{N}(\bm{0},\sigma\bm{I})italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( bold_0 , italic_σ bold_italic_I ), in the Euclidean case =2superscript2\mathcal{M}=\mathbb{R}^{2}caligraphic_M = blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where we set σ=1𝜎1\sigma=1italic_σ = 1 for the experiments reported next. In the Riemannian setting, i.e., =𝒮2superscript𝒮2\mathcal{M}=\mathcal{S}^{2}caligraphic_M = caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we define the base distribution as a wrapped Gaussian distribution [27, 28] centered at the manifold origin 𝒆=(0,,0,1)𝖳𝒮d𝒆superscript001𝖳superscript𝒮𝑑\bm{e}=(0,\ldots,0,1)^{\mathsf{T}}\in\mathcal{S}^{d}bold_italic_e = ( 0 , … , 0 , 1 ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, i.e., p0=𝒩𝒮2(𝒆,σ𝑰)subscript𝑝0subscript𝒩superscript𝒮2𝒆𝜎𝑰p_{0}=\mathcal{N}_{\mathcal{S}^{2}}\big{(}\bm{e},\sigma\bm{I}\big{)}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N start_POSTSUBSCRIPT caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_e , italic_σ bold_italic_I ) with σ=0.5𝜎0.5\sigma=0.5italic_σ = 0.5 for our experiments.

IV EXPERIMENTS

We evaluate RFMP on the LASA dataset [12] in the Euclidean space 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and on the same dataset but projected on the sphere 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Using these datasets, we consider: (1) Trajectory-based policies, where observations are defined as current and past states along the trajectories; and (2) Visuomotor policies, where observations correspond to vector features extracted from grayscale images.

IV-A Trajectory-based policies

To learn the vector field vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we use a dataset {{{𝒂m,τ,𝒐m,τ}c=1τ2}τ=2Tm}m=1Msuperscriptsubscriptsuperscriptsubscriptsuperscriptsubscriptsubscript𝒂𝑚𝜏subscript𝒐𝑚𝜏𝑐1𝜏2𝜏2subscript𝑇𝑚𝑚1𝑀\{\{\{\bm{a}_{m,\tau},\bm{o}_{m,\tau}\}_{c=1}^{\tau-2}\}_{\tau=2}^{T_{m}}\}_{m% =1}^{M}{ { { bold_italic_a start_POSTSUBSCRIPT italic_m , italic_τ end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_m , italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 2 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_τ = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT of M=7𝑀7M=7italic_M = 7 demonstrations containing Tm=200subscript𝑇𝑚200T_{m}=200italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 200 timesteps each, where 𝒂m,τ=[𝒂m,τ,,𝒂m,τ+Ta]subscript𝒂𝑚𝜏subscript𝒂𝑚𝜏subscript𝒂𝑚𝜏subscript𝑇𝑎\bm{a}_{m,\tau}=[\bm{a}_{m,\tau},\ldots,\bm{a}_{m,\tau+T_{a}}]bold_italic_a start_POSTSUBSCRIPT italic_m , italic_τ end_POSTSUBSCRIPT = [ bold_italic_a start_POSTSUBSCRIPT italic_m , italic_τ end_POSTSUBSCRIPT , … , bold_italic_a start_POSTSUBSCRIPT italic_m , italic_τ + italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] and 𝒐m,τ=[𝒐m,τ1,𝒐m,c,τc]subscript𝒐𝑚𝜏subscript𝒐𝑚𝜏1subscript𝒐𝑚𝑐𝜏𝑐\bm{o}_{m,\tau}=[\bm{o}_{m,\tau-1},\bm{o}_{m,c},\tau-c]bold_italic_o start_POSTSUBSCRIPT italic_m , italic_τ end_POSTSUBSCRIPT = [ bold_italic_o start_POSTSUBSCRIPT italic_m , italic_τ - 1 end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_m , italic_c end_POSTSUBSCRIPT , italic_τ - italic_c ] are the action and observation vectors of the τ𝜏\tauitalic_τ-th step of the m𝑚mitalic_m-th demonstration. All actions and observations are normalized and projected onto the manifold \mathcal{M}caligraphic_M of interest. In this experiment, both actions 𝒂τsubscript𝒂𝜏\bm{a}_{\tau}\in\mathcal{M}bold_italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∈ caligraphic_M and observations 𝒐τsubscript𝒐𝜏\bm{o}_{\tau}\in\mathcal{M}bold_italic_o start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∈ caligraphic_M are represented in the position space of the manifold \mathcal{M}caligraphic_M. The variance of the base distribution p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is selected such that the distribution roughly spans half of the sphere containing the data. We use a prediction horizon Ta=8subscript𝑇𝑎8T_{a}=8italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 8 and an execution horizon Te=Ta/2subscript𝑇𝑒subscript𝑇𝑎2T_{e}=T_{a}/2italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT / 2.

Figure 1 shows the learned RFMP flows conditioned on observations from the training dataset on the manifolds 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, for the letters 𝖲𝖲\mathsf{S}sansserif_S and 𝖶𝖶\mathsf{W}sansserif_W. We observe that the distributions reconstructed by RFMP closely match the original demonstrations. Figure 1(a) displays the trajectories reconstructed by sequentially executing the actions inferred by RFMP. The obtained trajectories closely follow the demonstrations when initialized with the same initial observations. Notably, when tested on initial conditions that are randomly-sampled on the neighborhood of the demonstrations support, RFMP generates trajectories that closely follow the demonstrations pattern. This kind of generalization is desired, for example, when the task demands to reproduce trajectories that closely resemble the demonstration style.

Refer to caption
Refer to caption
Refer to caption
(a) RFMP on 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.
Refer to caption
Refer to caption
Refer to caption
(b) RFMP on 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.
Refer to caption
Refer to caption
Refer to caption
(c) DP on 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.
Refer to caption
Refer to caption
Refer to caption
(d) DP on 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.
Figure 3: Demonstrations ( ) and learned trajectories on the LASA datasets 𝖲𝖲\mathsf{S}sansserif_S and 𝖶𝖶\mathsf{W}sansserif_W with different prediction horizons Ta={2,4,8}subscript𝑇𝑎248T_{a}=\{2,4,8\}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = { 2 , 4 , 8 } (from left to right). Reproductions start at the same initial observations as the demonstrations ( ), or from randomly-sampled observations in their neighborhood ( ).

We compare RFMP against diffusion policies (DP) [4]. To do so, we employ the CNN-based diffusion network with 256256256256M parameters provided by the authors. As in [4], we used the iDDPM algorithm [29] with the same 100100100100 denoising diffusion iterations for both training and inference. Moreover, we used the same prediction and execution horizons as for RFMP. Figure 1(b) shows the trajectories obtained by sequentially executing the actions inferred by DP. Similarly to RFMP, the trajectories closely match the demonstrations when initialized with the same initial observations (blue curves). Interestingly, unlike RFMP, the trajectories starting at randomly-sampled initial conditions close to the demonstrations (orange curves), tend to rejoin the demonstration data support, resulting in less variance across reproductions. This behavior might be partly explained by a high memorization of the training data [30], although this requires further investigation.

Importantly, the trajectories obtained by DP tend to be more jerky than those obtained with RFMP. We hypothesize that such jerky trajectories are a result of the inherent stochasticity of diffusion models during inference. These observations are supported by quantitative measures. Table I shows the dynamic time war** distance (DTWD) as a measure of reproduction accuracy for trajectories initialized with the same initial observations as the demonstrations, and the jerkiness as a measure of the trajectories smoothness [31]. We observe that DP produces trajectories that display a similar or lower DTWD than RFMP. This can be explained by the tendency of DP to generate trajectories within the demonstrations support, while RFMP displays an increased variance across reproductions. As observed qualitatively, RFMP produces arguably smoother trajectories than DP, as indicated by the lower jerkiness values reported in Table I.

Note that diffusion policies are not adapted to handle data on Riemannian manifolds, and thus do not provide any guarantees that the resulting trajectories lie on the manifold of interest. This can be observed, e.g., for the 𝖲𝖲\mathsf{S}sansserif_S dataset on 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where some trajectories enter the sphere in the middle part of the 𝖲𝖲\mathsf{S}sansserif_S trajectories. This means that a post-processing step would be required to ensure that the trajectories lie on the manifold of interest. Although possible, such post-processing steps are known to produce highly-inaccurate predictions as they disregard the intrinsic geometry of the data, as discussed in [32]. A more technically-sound solution would involve to adapt diffusion policies using Riemannian formulations of diffusion models [7, 33].

We also tested the capabilities of both RFMP and DP to learn multimodal policies on Riemannian manifolds. The rightmost plots in Fig. 2 show the resulting trajectories for initial conditions matching those of the demonstrations dataset (blue curves) and for initial points drawn from a region close to the demonstrations (orange curves). Although both RFMP and DP are able to learn the multimodal pattern, their generalization behavior is different when tested on initial conditions that are different from the training dataset. We can again observe that the DP trajectories tend to move back to the data support. Interestingly, in the multimodal case, RFMP outperforms DP in terms of both DTWD and smoothness (see Table I).

Next, we ablate the prediction horizon Tasubscript𝑇𝑎T_{a}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT for both RFMP and DP. Figure 3 shows the trajectories obtained for both policy types with Ta={2,4,8}subscript𝑇𝑎248T_{a}=\{2,4,8\}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = { 2 , 4 , 8 } and Te=Ta/2subscript𝑇𝑒subscript𝑇𝑎2T_{e}=T_{a}/2italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT / 2. The corresponding quantitative measures (DTWD and jerkiness) are given in Table II. Interestingly, the RFMP trajectories remain smooth despite the reduction of the prediction horizon, even for Ta=2subscript𝑇𝑎2T_{a}=2italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 2. In contrast, DP exhibits jerkier behaviors with shorter prediction horizons. This trend is especially pronounced for trajectories starting from initial observations that are different from the training dataset (see orange curves in Fig. 2(b)-left and Fig. 2(d)-left). We hypothesize that the reduction of the prediction horizon has a stronger impact on the inference process of DP than on RFMP due to the inherent stochasticity of diffusion models.

Finally, we compare the inference time of RFMP and DP in Table III. All computations were performed on a laptop with 2.602.602.602.60GHz ×12absent12\times 12× 12 CPU, a Nvidia Quatro T200 GPU, and 31313131 GB RAM. In the Euclidean case, we observe a reduction of 30%similar-toabsentpercent30\sim 30\%∼ 30 % (350similar-toabsent350\sim 350∼ 350ms) for the inference time of RFMP compared to DP. This is due to the fact that RFMP employs ODE solvers, which are generally much more efficient that the SDE solvers required for diffusion models. This result is an inherent strength of flow matching compared to diffusion models, as thoroughly analyzed in the flow matching literature [8, 22]. The inference time of RFMP on the sphere 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT exceeds that of DP. However, this is due to the fact that RFMP intrinsically handles data on Riemannian manifolds and therefore uses ODE solvers that are specific to this type of spaces. Instead, DP disregards the geometry of the data and thus employs Euclidean SDE solvers. A fair comparison of the inference times of RFMP and DP on the sphere would involve the adaptation of diffusion policies to data on Riemannian manifolds and leveraging Riemannian SDE solvers for inference.

DTWD Jerkiness
Dataset 𝖲𝖲\mathsf{S}sansserif_S, 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖲𝖲\mathsf{S}sansserif_S, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖶𝖶\mathsf{W}sansserif_W, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT multi-𝖫𝖫\mathsf{L}sansserif_L, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖲𝖲\mathsf{S}sansserif_S, 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖲𝖲\mathsf{S}sansserif_S, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖶𝖶\mathsf{W}sansserif_W, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT multi-𝖫𝖫\mathsf{L}sansserif_L, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
RFMP 1.87±0.94plus-or-minus1.870.941.87\pm 0.941.87 ± 0.94 0.95±0.32plus-or-minus0.950.320.95\pm 0.320.95 ± 0.32 1.64±0.84plus-or-minus1.640.841.64\pm 0.841.64 ± 0.84 6.14±6.56plus-or-minus6.146.56\bm{6.14\pm 6.56}bold_6.14 bold_± bold_6.56 𝟐𝟏𝟐𝟎±𝟐𝟕𝟑plus-or-minus2120273\bm{2120\pm 273}bold_2120 bold_± bold_273 𝟒𝟎𝟕𝟕±𝟗𝟎𝟎plus-or-minus4077900\bm{4077\pm 900}bold_4077 bold_± bold_900 4198±560plus-or-minus41985604198\pm 5604198 ± 560 𝟐𝟏𝟔𝟏±𝟔𝟒𝟎plus-or-minus2161640\bm{2161\pm 640}bold_2161 bold_± bold_640
DP 0.98±0.22plus-or-minus0.980.22\bm{0.98\pm 0.22}bold_0.98 bold_± bold_0.22 0.80±0.21plus-or-minus0.800.21\bm{0.80\pm 0.21}bold_0.80 bold_± bold_0.21 0.90±0.35plus-or-minus0.900.35\bm{0.90\pm 0.35}bold_0.90 bold_± bold_0.35 7.06±7.73plus-or-minus7.067.737.06\pm 7.737.06 ± 7.73 8172±747plus-or-minus81727478172\pm 7478172 ± 747 7612±543plus-or-minus76125437612\pm 5437612 ± 543 𝟐𝟗𝟒𝟒±𝟏𝟑𝟗𝟗plus-or-minus29441399\bm{2944\pm 1399}bold_2944 bold_± bold_1399 2201±744plus-or-minus22017442201\pm 7442201 ± 744
TABLE I: Average dynamic time war** distance (DTWD) and jerkiness (a.k.a smoothness) for trajectory-based RFMP and DP. DTWD is computed between the demonstrations and the reproductions initialized as the demonstrations, while the smoothness is averaged over all the reproductions displayed in Fig. 2.
DTWD, 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Jerkiness, 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT DTWD, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Jerkiness, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Tasubscript𝑇𝑎T_{a}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 2222 4444 8888 2222 4444 8888 2222 4444 8888 2222 4444 8888
RFMP 1.13±0.29plus-or-minus1.130.291.13\pm 0.291.13 ± 0.29 2.31±1.25plus-or-minus2.311.252.31\pm 1.252.31 ± 1.25 1.87±0.94plus-or-minus1.870.941.87\pm 0.941.87 ± 0.94 𝟑𝟒𝟓𝟒±𝟐𝟕𝟔plus-or-minus3454276\bm{3454\pm 276}bold_3454 bold_± bold_276 𝟑𝟕𝟐𝟗±𝟒𝟕𝟓plus-or-minus3729475\bm{3729\pm 475}bold_3729 bold_± bold_475 𝟐𝟏𝟐𝟎±𝟐𝟕𝟑plus-or-minus2120273\bm{2120\pm 273}bold_2120 bold_± bold_273 0.52±0.18plus-or-minus0.520.18\bm{0.52\pm 0.18}bold_0.52 bold_± bold_0.18 0.90±0.34plus-or-minus0.900.340.90\pm 0.340.90 ± 0.34 0.95±0.32plus-or-minus0.950.320.95\pm 0.320.95 ± 0.32 𝟐𝟖𝟒𝟓±𝟑𝟑𝟓plus-or-minus2845335\bm{2845\pm 335}bold_2845 bold_± bold_335 6169±556plus-or-minus61695566169\pm 5566169 ± 556 𝟒𝟎𝟕𝟕±𝟗𝟎𝟎plus-or-minus4077900\bm{4077\pm 900}bold_4077 bold_± bold_900
DP 0.70±0.07plus-or-minus0.700.07\bm{0.70\pm 0.07}bold_0.70 bold_± bold_0.07 0.76±0.25plus-or-minus0.760.25\bm{0.76\pm 0.25}bold_0.76 bold_± bold_0.25 0.98±0.22plus-or-minus0.980.22\bm{0.98\pm 0.22}bold_0.98 bold_± bold_0.22 11905±1962plus-or-minus11905196211905\pm 196211905 ± 1962 7289±939plus-or-minus72899397289\pm 9397289 ± 939 8172±747plus-or-minus81727478172\pm 7478172 ± 747 0.60±0.16plus-or-minus0.600.160.60\pm 0.160.60 ± 0.16 0.70±0.33plus-or-minus0.700.33\bm{0.70\pm 0.33}bold_0.70 bold_± bold_0.33 0.80±0.21plus-or-minus0.800.21\bm{0.80\pm 0.21}bold_0.80 bold_± bold_0.21 6586±3140plus-or-minus658631406586\pm 31406586 ± 3140 𝟒𝟐𝟑𝟗±𝟏𝟑𝟎𝟔plus-or-minus42391306\bm{4239\pm 1306}bold_4239 bold_± bold_1306 7612±543plus-or-minus76125437612\pm 5437612 ± 543
TABLE II: Average dynamic time war** distance (DTWD) and jerkiness (a.k.a smoothness) for RFMP and DP with different prediction horizons Ta={2,4,8}subscript𝑇𝑎248T_{a}=\{2,4,8\}italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = { 2 , 4 , 8 }. DTWD is computed between the demonstrations and the reproductions initialized as the demonstrations, while the smoothness is averaged over all the reproductions displayed in Fig. 3.
Trajectory-based Visuomotor
Dataset 𝖲𝖲\mathsf{S}sansserif_S, 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖲𝖲\mathsf{S}sansserif_S, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖲𝖲\mathsf{S}sansserif_S, 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖲𝖲\mathsf{S}sansserif_S, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
RFMP 𝟖𝟎𝟑±𝟓𝟓plus-or-minus80355\bm{803\pm 55}bold_803 bold_± bold_55 1539±23plus-or-minus1539231539\pm 231539 ± 23 𝟏𝟑𝟓𝟓±𝟏𝟏𝟎plus-or-minus1355110\bm{1355\pm 110}bold_1355 bold_± bold_110 𝟐𝟑𝟓𝟏±𝟖𝟖plus-or-minus235188\bm{2351\pm 88}bold_2351 bold_± bold_88
DP 1142±17plus-or-minus1142171142\pm 171142 ± 17 𝟏𝟏𝟒𝟕±𝟐𝟔plus-or-minus114726\bm{1147\pm 26}bold_1147 bold_± bold_26 2462±141plus-or-minus24621412462\pm 1412462 ± 141 2662±541plus-or-minus26625412662\pm 5412662 ± 541
TABLE III: Inference times (in milliseconds) per prediction step for RFMP and DP. These are averaged across the 50505050 prediction steps with Ta=8subscript𝑇𝑎8T_{a}=8italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 8, for both the 14141414 reproductions displayed in Fig. 2 for trajectory-based policies, and the 7777 reproductions displayed in Fig. 5 for visuomotor policies.

IV-B Towards visuomotor policies

Refer to caption
(a) 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Refer to caption
(b) 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Figure 4: Examples of visual observations at the end of a demonstration of the LASA dataset 𝖲𝖲\mathsf{S}sansserif_S.

In this section, we study the case where the RFMP vector field is conditioned on visual observations, thus resembling a visuomotor diffusion policy. Similarly to Section IV-A, we use a dataset defined as {{{𝒂m,τ,𝒐m,τ}c=1τ2}τ=2Tm}m=1Msuperscriptsubscriptsuperscriptsubscriptsuperscriptsubscriptsubscript𝒂𝑚𝜏subscript𝒐𝑚𝜏𝑐1𝜏2𝜏2subscript𝑇𝑚𝑚1𝑀\{\{\{\bm{a}_{m,\tau},\bm{o}_{m,\tau}\}_{c=1}^{\tau-2}\}_{\tau=2}^{T_{m}}\}_{m% =1}^{M}{ { { bold_italic_a start_POSTSUBSCRIPT italic_m , italic_τ end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_m , italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 2 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_τ = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT of M=7𝑀7M=7italic_M = 7 demonstrations containing Tm=200subscript𝑇𝑚200T_{m}=200italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 200 timesteps each, where 𝒂m,τ=[𝒂m,τ,,𝒂m,τ+Ta]subscript𝒂𝑚𝜏subscript𝒂𝑚𝜏subscript𝒂𝑚𝜏subscript𝑇𝑎\bm{a}_{m,\tau}=[\bm{a}_{m,\tau},\ldots,\bm{a}_{m,\tau+T_{a}}]bold_italic_a start_POSTSUBSCRIPT italic_m , italic_τ end_POSTSUBSCRIPT = [ bold_italic_a start_POSTSUBSCRIPT italic_m , italic_τ end_POSTSUBSCRIPT , … , bold_italic_a start_POSTSUBSCRIPT italic_m , italic_τ + italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] and 𝒐m,τ=[𝒐m,τ1,𝒐m,c,τc]subscript𝒐𝑚𝜏subscript𝒐𝑚𝜏1subscript𝒐𝑚𝑐𝜏𝑐\bm{o}_{m,\tau}=[\bm{o}_{m,\tau-1},\bm{o}_{m,c},\tau-c]bold_italic_o start_POSTSUBSCRIPT italic_m , italic_τ end_POSTSUBSCRIPT = [ bold_italic_o start_POSTSUBSCRIPT italic_m , italic_τ - 1 end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_m , italic_c end_POSTSUBSCRIPT , italic_τ - italic_c ] are the action and observation vectors of the τ𝜏\tauitalic_τ-th step of the m𝑚mitalic_m-th demonstration. Again, all action and observation vectors are normalized and projected onto the manifold \mathcal{M}caligraphic_M of interest. However, in this case, the observations 𝒐τsubscript𝒐𝜏\bm{o}_{\tau}\in\mathcal{M}bold_italic_o start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∈ caligraphic_M are given by the latent encodings (a.k.a feature vectors) of 48×48484848\times 4848 × 48 raw grayscale images depicting the temporal progress of the task. Examples of such images are shown in Fig. 4. Specifically, our vision perception backbone, which maps raw grayscale images to observation vectors, is exactly the same used in DP [4]. Namely, we used a standard ResNet-18181818 in which we replaced: (1) the global average pooling with a spatial softmax pooling, and (2) BatchNorm with GroupNorm. The former modification maintains spatial information [34], while the latter stabilizes the training [35].

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) RFMP trajectories.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) DP trajectories.
Figure 5: Demonstrations ( ) and trajectories reproduced by the visuomotor RFMP and DP ( ) on the LASA datasets 𝖲𝖲\mathsf{S}sansserif_S and 𝖩𝖩\mathsf{J}sansserif_J in 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (left) and on the LASA datasets 𝖲𝖲\mathsf{S}sansserif_S and 𝖶𝖶\mathsf{W}sansserif_W projected on 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (right).
DTWD Jerkiness
Dataset 𝖲𝖲\mathsf{S}sansserif_S, 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖩𝖩\mathsf{J}sansserif_J, 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖲𝖲\mathsf{S}sansserif_S, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖶𝖶\mathsf{W}sansserif_W, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖲𝖲\mathsf{S}sansserif_S, 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖩𝖩\mathsf{J}sansserif_J, 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖲𝖲\mathsf{S}sansserif_S, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝖶𝖶\mathsf{W}sansserif_W, 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
RFMP 1.22±0.44plus-or-minus1.220.44\bm{1.22\pm 0.44}bold_1.22 bold_± bold_0.44 1.82±0.93plus-or-minus1.820.93\bm{1.82\pm 0.93}bold_1.82 bold_± bold_0.93 0.76±0.27plus-or-minus0.760.270.76\pm 0.270.76 ± 0.27 0.84±0.48plus-or-minus0.840.48\bm{0.84\pm 0.48}bold_0.84 bold_± bold_0.48 10543±612plus-or-minus1054361210543\pm 61210543 ± 612 7655±537plus-or-minus76555377655\pm 5377655 ± 537 𝟑𝟓𝟗𝟎±𝟑𝟓𝟑plus-or-minus3590353\bm{3590\pm 353}bold_3590 bold_± bold_353 𝟒𝟒𝟓𝟓±𝟑𝟎𝟔plus-or-minus4455306\bm{4455\pm 306}bold_4455 bold_± bold_306
DP 1.29±0.49plus-or-minus1.290.491.29\pm 0.491.29 ± 0.49 2.35±1.66plus-or-minus2.351.662.35\pm 1.662.35 ± 1.66 0.67±0.24plus-or-minus0.670.24\bm{0.67\pm 0.24}bold_0.67 bold_± bold_0.24 0.93±0.48plus-or-minus0.930.480.93\pm 0.480.93 ± 0.48 𝟔𝟏𝟗𝟖±𝟕𝟓𝟓plus-or-minus6198755\bm{6198\pm 755}bold_6198 bold_± bold_755 𝟓𝟓𝟖𝟖±𝟖𝟎𝟏plus-or-minus5588801\bm{5588\pm 801}bold_5588 bold_± bold_801 5903±170plus-or-minus59031705903\pm 1705903 ± 170 5042±136plus-or-minus50421365042\pm 1365042 ± 136
TABLE IV: Average dynamic time war** distance (DTWD) and jerkiness (a.k.a smoothness) for visuomotor RFMP and DP. DTWD is computed between the demonstrations and the reproductions displayed in Fig. 5 and the smoothness is averaged over the same reproductions.

We trained the visual encoder end-to-end with our RFMP, for which we used the same base distributions p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and prediction horizon Ta=8subscript𝑇𝑎8T_{a}=8italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 8 as in Section IV-A. In this visuomotor RFMP, we empirically observed that shortening the observations horizon used to sample the context observation increases temporal consistency and improves the smoothness on the predicted actions. Therefore, in the following experiments, we sample c𝒰{τw,,τ2}similar-to𝑐𝒰𝜏𝑤𝜏2c\sim\mathcal{U}\{\tau-w,\ldots,\tau-2\}italic_c ∼ caligraphic_U { italic_τ - italic_w , … , italic_τ - 2 } with w=50𝑤50w=50italic_w = 50. Figure 4(a) shows the demonstrations and the reproduced trajectories of the learned visuomotor RFMP. For both demonstrations and reproductions, the initial observations correspond to blank images for policies trained in 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and an empty grayscale sphere for policies in 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Similarly to the trajectory-based policies, the visuomotor RFMP successfully reproduces trajectories that match the demonstrations pattern in both the Euclidean and the Riemannian settings.

We compare visuomotor RFMP against visuomotor DP. As in [4], and similarly to the visuomotor RFMP, we train the vision perception backbone (modified ResNet-18181818) end-to-end with the CNN-based diffusion network described in Section IV-A. Figure 4(b) shows the trajectories obtained by sequentially executing the actions inferred by the visuomotor DP. Similarly to RFMP, the trajectories closely match the demonstrations. Interestingly, the visuomotor RFMP competitively performs when compared to the visuomotor DP in terms of the DTWD metric, as shown in Table IV, despite having a simpler architecture parametrizing the RFMP vector field. Moreover, as observed in Section IV-A for the trajectory-based case, visuomotor RFMP leads to smooth trajectories, especially for policies on 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as indicated by the low jerkiness values in Table IV. Let us emphasize once more that RFMP ensures that the predicted actions lie on the manifold of interest, as opposed to DP which does not provide such guarantees.

Finally, we compare the inference time of visuomotor RFMP and DP in Table III. We observe a reduction of 45%similar-toabsentpercent45\sim 45\%∼ 45 % (900similar-toabsent900\sim 900∼ 900ms) for the inference time of RFMP compared to DP. Interestingly, this reduction is greater for visuomotor policies compared to the trajectory-based case. Moreover, the inference time of RFMP on the sphere 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is similar to that of DP, despite that RFMP uses Riemannian-specific ODE solvers which are computationally more expensive than Euclidean ODE solvers. The reported findings indicate comparable performance between RFMP and DP in terms of task completion. However, RFMP exhibits a clear advantage in generating smoother action predictions, and this advantage remains consistent regardless of the prediction horizon. Furthermore, RFMP boasts significantly faster inference times compared to DP. These attributes make RFMP a compelling choice for real-time applications in various robotic domains.

V CONCLUSION

We introduced Riemannian Flow Matching Policies (RFMP), a novel learning framework that leverages the simplicity and fast inference of flow matching models to model visuomotor robot policies on Riemannian manifolds. We evaluated RFMP on both trajectory-based and vision-based settings using the LASA dataset. Our results demonstrated that RFMP successfully learns policies that reproduce the demonstration patterns even for initial conditions outside the training data. Compared to Diffusion Policies (DP), RFMP generates smoother predicted trajectories with significantly lower inference times. Interestingly, RFMP exhibited less performance degradation with decreasing action prediction horizons. Notably, RFMP achieved this competitive performance using a simple MLP architecture for its vector field, in contrast to the more powerful CNN architecture employed by DP in score matching. Our proof-of-concept experiments showed the potential of RFMP for learning complex visuomotor policies in real-world robotic applications. Future work will evaluate the performance of RFMP in real-world robotics applications. Moreover, we will explore more powerful representations for the RFMP vector field and more informative prior models.

References

  • [1] M. Janner, Y. Du, J. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,” in Intl. Conf. on Machine Learning (ICML), 2022, pp. 9902–9915.
  • [2] M. Reuss, M. Li, X. Jia, and R. Lioutikov, “Goal-conditioned imitation learning using score-based diffusion policies,” in Robotics: Science and Systems (R:SS), 2023.
  • [3] Z. Wang, J. J. Hunt, and M. Zhou, “Diffusion policies as an expressive policy class for offline reinforcement learning,” in Intl. Conf. on Learning Representations (ICLR), 2023.
  • [4] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Robotics: Science and Systems (R:SS), 2023.
  • [5] C. Luo, “Understanding diffusion models: A unified perspective,” arXiv preprint arXiv2208.11970, 2022.
  • [6] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Comput. Surv., vol. 56, no. 4, 2023.
  • [7] C.-W. Huang, M. Aghajohari, J. Bose, P. Panangaden, and A. Courville, “Riemannian diffusion models,” in Neural Information Processing Systems (NeurIPS), 2022.
  • [8] Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” in Intl. Conf. on Learning Representations (ICLR), 2023.
  • [9] A. Davtyan, S. Sameni, and P. Favaro, “Efficient video prediction via sparsely conditioned flow matching,” in Intl. Conf. on Computer Vision (ICCV), 2023, pp. 23 206–23 217.
  • [10] A. H. Liu, M. Le, A. Vyas, B. Shi, A. Tjandra, and W.-N. Hsu, “Generative pre-training for speech with flow matching,” in Intl. Conf. on Learning Representations (ICLR), 2024.
  • [11] J. Bose, T. Akhound-Sadegh, K. FATRAS, G. Huguet, J. Rector-Brooks, C.-H. Liu, A. C. Nica, M. Korablyov, M. M. Bronstein, and A. Tong, “SE(3)-stochastic flow matching for protein backbone generation,” in Intl. Conf. on Learning Representations (ICLR), 2024.
  • [12] A. Lemme, Y. Meirovitch, M. Khansari-Zadeh, T. Flash, A. Billard, and J. J. Steil, “Open-source benchmarking for learned reaching motion generation in robotics,” Paladyn, Journal of Behavioral Robotics, vol. 6, no. 1, 2015.
  • [13] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan, “Normalizing flows for probabilistic modeling and inference,” Journal of Machine Learning Research, vol. 22, no. 57, pp. 1–64, 2021.
  • [14] M. A. Rana, A. Li, D. Fox, B. Boots, F. Ramos, and N. Ratliff, “Euclideanizing flows: Diffeomorphic reduction for learning stable dynamical systems,” in Conference on Learning for Dynamics and Control (L4DC), 2020, pp. 630–639.
  • [15] S. A. Khader, H. Yin, P. Falco, and D. Kragic, “Learning stable normalizing-flow control for robotic manipulation,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), 2021, pp. 1644–1650.
  • [16] J. Urain, M. Ginesi, D. Tateo, and J. Peters, “Imitationflow: Learning deep stable stochastic dynamic systems by normalizing flows,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2020, pp. 5231–5237.
  • [17] J. Urain, D. Tateo, and J. Peters, “Learning stable vector fields on lie groups,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 12 569–12 576, 2022.
  • [18] J. Zhang, H. B. Mohammadi, and L. Rozo, “Learning Riemannian stable dynamical systems via diffeomorphisms,” in Conference on Robot Learning (CoRL), ser. Proceedings of Machine Learning Research, vol. 205, 2023, pp. 1211–1221.
  • [19] R. T. Q. Chen and Y. Lipman, “Flow matching on general geometries,” in Intl. Conf. on Learning Representations (ICLR), 2024.
  • [20] M. do Carmo, Riemannian Geometry.   Birkhäuser Basel, 1992.
  • [21] J. M. Lee, Introduction to Riemannian Manifolds.   Springer, 2018.
  • [22] A. Tong, K. Fatras, N. Malkin, G. Huguet, Y. Zhang, J. Rector-Brooks, G. Wolf, and Y. Bengio, “Improving and generalizing flow-based generative models with minibatch optimal transport,” Transactions on Machine Learning Research (TMLR), 2024.
  • [23] J. Dormand and P. Prince, “A family of embedded runge-kutta formulae,” Journal of Computational and Applied Mathematics, vol. 6, no. 1, pp. 19–26, 1980.
  • [24] M. Poli, S. Massaroli, A. Yamashita, H. Asama, J. Park, and S. Ermon, “TorchDyn: Implicit models and neural numerical methods in PyTorch,” arXiv preprint arXiv:2009.09346, 2020.
  • [25] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv preprint arXiv:1710.05941, 2017.
  • [26] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation by averaging,” SIAM Journal on Control and Optimization, vol. 30, no. 4, pp. 838–855, 1992.
  • [27] K. V. Mardia and P. E. Jupp, Distributions on Spheres.   John Wiley and Sons, Ltd, 1999, ch. 9, pp. 159–192.
  • [28] F. Galaz-Garcia, M. Papamichalis, K. Turnbull, S. Lunagomez, and E. Airoldi, “Wrapped distributions on homogeneous Riemannian manifolds,” arXiv preprint 2204.09790, 2022.
  • [29] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in Intl. Conf. on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 139, 2021, pp. 8162–8171.
  • [30] T. Yoon, J. Y. Choi, S. Kwon, and E. K. Ryu, “Diffusion probabilistic models generalize when they fail to memorize,” in ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023.
  • [31] S. Balasubramanian, A. Melendez-Calderon, A. Roby-Brami, and E. Burdet, “On the analysis of movement smoothness,” Journal of NeuroEngineering and Rehabilitation, vol. 12, no. 112, 2015.
  • [32] N. Jaquier, L. Rozo, and T. Asfour, “Unraveling the single tangent space fallacy: An analysis and clarification for applying Riemannian geometry in robot learning,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), 2024.
  • [33] A. Lou, M. Xu, A. Farris, and S. Ermon, “Scaling Riemannian diffusion models,” in Neural Information Processing Systems (NeurIPS), 2023.
  • [34] A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y. Zhu, and R. Martín-Martín, “What matters in learning from offline human demonstrations for robot manipulation,” in Conference on Robot Learning (CoRL), ser. Proceedings of Machine Learning Research, vol. 164, 2022, pp. 1678–1690.
  • [35] Y. Wu and K. He, “Group normalization,” in European conference on computer vision (ECCV), 2018, pp. 3–19.