WANDR: Intention-guided Human Motion Generation

Markos Diomataris 1,2 Nikos Athanasiou1 Omid Taheri1 Xi Wang2
Otmar Hilliges2 Michael J. Black1
1Max Planck Institute for Intelligent Systems, Tübingen, Germany       2ETH Zürich, Switzerland
Abstract

Synthesizing natural human motions that enable a 3D human avatar to walk and reach for arbitrary goals in 3D space remains an unsolved problem with many applications. Existing methods (data-driven or using reinforcement learning) are limited in terms of generalization and motion naturalness. A primary obstacle is the scarcity of training data that combines locomotion with goal reaching. To address this, we introduce WANDR, a data-driven model that takes an avatar’s initial pose and a goal’s 3D position and generates natural human motions that place the end effector (wrist) on the goal location. To solve this, we introduce novel intention features that drive rich goal-oriented movement. Intention guides the agent to the goal, and interactively adapts the generation to novel situations without needing to define sub-goals or the entire motion path. Crucially, intention allows training on datasets that have goal-oriented motions as well as those that do not. WANDR is a conditional Variational Auto-Encoder (c-VAE), which we train using the AMASS and CIRCLE datasets. We evaluate our method extensively and demonstrate its ability to generate natural and long-term motions that reach 3D goals and generalize to unseen goal locations. Our models and code are available for research purposes at wandr.is.tue.mpg.de.

[Uncaptioned image]

Figure 1: WANDR starts from an arbitrary body pose and generates precise and realistic human motions that reach a specified 3D goal (depicted as a red sphere). Employing a purely data-driven approach, WANDR is a conditional Variational Autoencoder guided by intention features (depicted arrows) that steer the human’s orientation (yellow), position (cyan) and wrist (pink) towards the goal. WANDR is able to reach a wide range of goals even if they deviate significantly from the training data.

1 Introduction

Goals drive our motions. Even the simplest goal can give rise to intricate motions. Consider reaching for a coffee cup – it can be as straightforward as an arm extension or can involve the coordinated action of our entire body. Actions like bending down, extending our arm, and walking must come together to achieve the goal. At a granular level, we continuously make subtle adjustments to maintain balance and stay on course towards our objective. The result is a fluid motion that seamlessly integrates numerous smaller movements, all converging toward a common and simple goal: placing our hand on the cup. Generating this hierarchy of motions, from the overarching goal to the moment-to-moment individual actions, remains a longstanding challenge in computer vision, graphics, and robotics.

Here we focus on a representative task, illustrated in Fig. 1: given a goal location in space and a starting pose, a humanoid agent must place an end effector (wrist joint) on the goal location while moving in a natural human-like way. To solve the task, the agent needs to be able to approach the goal, orient itself towards it, and reach out such that its wrist makes contact with the goal. Our primary emphasis is on ensuring autonomy for human agents. Consequently, we strive to minimize the guidance information provided, limiting it only to the human’s initial pose and the goal’s position. Diverging from prior data-driven approaches [4, 19], we choose to refrain from evaluating the model solely on limited labeled data. Instead, we devise an evaluation pipeline that requires agents to reach goals positioned in diverse locations around them. Considering the arbitrary selection of the goal during evaluation, and the minimal guidance information provided, tackling this task is challenging, demanding an approach with the capacity to generalize beyond the distribution of the training dataset.

Existing methods approach this problem either using reinforcement learning (RL) [17, 27, 11, 46] or by capturing task-specific datasets [35, 8, 4]. While RL provides a principled way to explore the solution space, it comes with considerable shortcomings. The “trial and error" of exploratory learning, in combination with the high dimensionality of human motion result in policies requiring an enormous amount of training even to achieve simple tasks such as walking to a waypoint [46, 5]. In addition, since motion naturalness is better captured by data and not reward functions, RL approaches tend to produce motions that lack naturalness and expressiveness. Data-driven approaches on the other hand, rely on plentiful training motions that are acquired through motion capture and carefully curated for the downstream tasks [10, 4]. Such approaches do not scale and do not generalize well to out-of-distribution tasks.

In prioritizing both motion realism and training efficiency, we adopt a data-driven approach. However, current data-driven methods lack the ability to learn both from smaller datasets that provide high-quality human reaching motions with goal labels, and from unlabeled larger scale datasets that contain necessary motion skills such as navigating to a goal position. This raises two key challenges. First, how do we model human motion in a way that generated motions can combine skills from different datasets? Second, what should the training objective be in the cases where goal labels are absent?

To address these challenges, we propose WANDR (Wrist-driven Autonomous Navigation for Data-based goal Reaching). We observe that by modeling human motion generation as an autoregressive stochastic process that produces motions frame by frame, WANDR is able to combine pieces of different dataset distributions when generating a motion sequence. Each generation step is conditioned on goal-related information that we call intention (visualized arrows in Fig. 1). We carefully design intention in a way that strikes a balance between being informative enough to guide the avatar to reach the goal, while also being abstract enough to promote generalization to unseen goals. This allows our generated motions to reach goals that were never encountered during the training phase in a completely zero-shot evaluation scenario. By generating the motion in an autoregressive way, we disentangle the spatial and temporal dimensions of motion. This is necessary as it allows our model to generate novel long-term sequences while being realistic in terms of local dynamic details.

In more detail, our method is based on a conditional Variational Auto-Encoder (c-VAE) that learns to model motion as a frame-by-frame generation process by auto-encoding the pose difference between two adjacent frames. The condition signal consists of the human’s current pose and dynamics along with the intention information. Intention is a function of both the current pose and the goal location and therefore actively guides the avatar during the motion generation in a closed loop manner. Through training, the c-VAE learns the distribution of potential subsequent poses conditioned on the current dynamic state of the human and its intention towards a specific goal. We train WANDR using two datasets: AMASS [20], which captures a wide range of motions including locomotion, and CIRCLE [4], which captures reaching motions.

Although AMASS is large, it lacks any explicit label of goals or intentions. To address this, inspired by the Hindsight Experience Replay paradigm in robotics [3], we define intention using a hallucinated goal derived from the ground-truth wrist position in a future frame. This approach allows us to establish a unified training objective spanning AMASS and CIRCLE. Consequently, our model learns to combine motions from both datasets, enabling it to effectively reach arbitrary goals during testing.

In summary, we present WANDR, a data-driven method that combines an autoregressive motion prior with a novel intention guiding mechanism and is able to generate avatars that realistically move in space and reach arbitrary goals. We experimentally evaluate our approach, including the benefit of combining multiple datasets as well as the generalization capabilities of our motion generator. Our results underscore the efficacy of the intention mechanism as an elegant way of guiding the motion generation process while also enabling the incorporation of pseudo goal labels for datasets lacking explicit goal annotations. The model and code are available for research purposes.

2 Related Work

Early research in motion generation focuses on tasks like motion prediction [9, 2, 1, 7, 21, 30] and unconstrained motion generation [40, 43, 41, 26, 32, 23, 31, 39, 6, 18]. More recently, significant effort has been devoted into improving controllability, with a focus on motion generation conditioned on different types of goals [42, 28], enabling interactions with scenes [12, 13, 22] and objects [34, 45, 36, 48]. Methods that attempt goal-driven motion generation can be broadly divided into reinforcement learning or data-driven approaches.

2.1 Reinforcement Learning for Motion Synthesis

Many existing works employ Reinforcement Learning (RL) for the generation of task-specific long motion sequences. Representative work includes MotionVAE [17] and AMP [27]. MotionVAE [17] employs a two-step process where it initially leverages an autoregressive conditional Variational Autoencoder (VAE) to construct a latent space that encapsulates possible human movements. Subsequently, it utilizes RL to sample from this action space to reach a designated target location while avoiding obstacles by monitoring the area ahead. Similar to MotionVAE, GAMMA [46] learns a policy to extract samples from a latent space and then employs a tree-based search algorithm to find viable motions that steer clear of obstacles by considering the environmental geometric constraints. DIMOS [47] further extends the GAMMA framework by introducing two specialized policy networks: one for locomotion and one for interaction. Together these networks generate goal-conditioned motion sequences that dynamically interact with objects and the environment. AMP [27] learns an adversarial motion prior from unstructured datasets and then applies goal-conditioned RL. This approach involves the formulation of a style reward to encourage the resemblance of the generated sequences to those in the dataset, complemented by a task-specific reward aimed at achieving a particular objective. Hassan et al. [11] extend AMP to produce motions that facilitate interactions with the scene, by conditioning both the discriminator and the policy network on the scene context. However, RL requires significant computation and struggles to generate natural and expressive motion sequences.

2.2 Data-driven Approaches for Motion Synthesis

Most data-driven approaches use existing motion capture (MoCap) datasets [20, 38] to train their models through supervised learning. The pioneering Neural State Machine method [33] is a data-driven technique for generating motion with character-scene interactions, focusing on scenarios with a limited number of objects and interactions. HuMoR [29] proposes a robust model for 3D human shape and temporal pose estimation, yet it falls short of generating motions that are conditioned on specific goals. The SAMP [10] method, designed for real-time stochastic motion synthesis, generates diverse human-scene interaction movements by breaking down the process into predicting goals, planning paths, and generating motion along a predefined route. The GOAL [35] method, trained on the GRAB dataset [34], produces motion sequences in which humans walk towards and grasp 3D objects. However, the generated motions exhibit minimal movements, especially in the feet. To address these constraints, the newly introduced CIRCLE [4] dataset provides a collection of reaching motion data. This dataset is used to train a neural network that generates diverse scene-aware reaching motions. Recently, diffusion models have seen the most success at generating motions conditioned on textural input [37, 44] and spatial data [14]. This advancement enables the synthesized motion to accurately reach specified target locations or navigate around obstacles. Nevertheless, the effectiveness of data-driven approaches is constrained by the amount of training data and they lack generalization to out-of-distribution scenarios.

3 Method

Refer to caption
Figure 2: WANDR architecture. During training, our model conditions on the intention vectors Ip,Irsuperscript𝐼𝑝superscript𝐼𝑟I^{p},I^{r}italic_I start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_I start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT and Iwsuperscript𝐼𝑤I^{w}italic_I start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT, learning to associate them with actions that result into reaching goals realistically. When the training data has no defined goal, we create a goal based on the wrist location in future frames; see Sec. 3.2. The state of the avatar, pidynsubscriptsuperscript𝑝𝑑𝑦𝑛𝑖p^{dyn}_{i}italic_p start_POSTSUPERSCRIPT italic_d italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT expresses the SMPL-X local pose parameters pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as well as the deltas di1subscript𝑑𝑖1d_{i-1}italic_d start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT the body parameters have in frame i1𝑖1i-1italic_i - 1. During inference, WANDR takes the intention features, the state, and random noise and returns the change in pose, d^isubscript^𝑑𝑖\hat{d}_{i}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The next pose, p^isubscript^𝑝𝑖\hat{p}_{i}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is obtained by integrating the d^isubscript^𝑑𝑖\hat{d}_{i}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the previous pose p^i1subscript^𝑝𝑖1\hat{p}_{i-1}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT.
Refer to caption
Figure 3: In training, if goals are not specified, they are determined by the future wrist location at a randomly selected future timestep, compensating for the lack of paired ground-truth data in AMASS and direct human motion through intention vectors. During inference, target locations are used as goals with intention vectors calculated based on these specific locations.

Our goal is to have a virtual human that can autonomously and realistically move from an initial pose to an arbitrary goal position and accurately place its right hand on the target. This challenge requires a nuanced understanding of human motion and the intricate dynamics involved in goal-oriented motions. For example, when the human tries to reach a distant goal, the motions are mostly focused on the legs and navigating the body to approach the object, but when it gets close to the object, the focus will be on moving the arms and upper body to reach the target. Using these observations, we develop a method named WANDR, which, although trained in a supervised setting on motion capture data, exhibits generalization in reaching unseen goal locations during test time. WANDR is designed to generate human motion in an autoregressive frame-by-frame fashion, conditioned on novel intention features. During training, these features are extracted by picking a future frame as the goal for the wrist. During inference, the intention features are dynamically computed based on the goal’s position in a feedback loop, guiding the virtual human to reach the goal. See Fig. 2 for the network overview.

In this section, we first consider the different distributions of the datasets we will be using (section 3.1). Following this, we detail the components of the intention features and how they are computed during both the training and inference phases (Section 3.2). Finally, we define the motion representation and motion generator network (Section 3.3).

3.1 Two Complementing Datasets

For the development of WANDR, we use two key datasets: AMASS and CIRCLE. AMASS is a large-scale dataset that offers a broad range of general human motions but lacks a specific focus on goal-reaching tasks. Its diverse collection of movements provides a solid base for understanding human locomotion and body movement when the person is far away from the goal location. In contrast to AMASS, CIRCLE is tailored towards movements involving reaching specific target positions, particularly capturing the nuances of upper body and arm movements. By integrating AMASS’s general motion diversity with CIRCLE’s targeted goal-reaching data, we equip WANDR with the ability to generate the entire process of reaching a distant goal, from the initial navigation to the precise target-reaching actions.

3.2 Intention Features

We represent 3D human motion as a sequence of SMPL-X [25] body poses p={p1,,pN}psubscript𝑝1subscript𝑝𝑁\textbf{p}=\{p_{1},...,p_{N}\}p = { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }. Each pose pi135subscript𝑝𝑖superscript135p_{i}\in\mathbb{R}^{135}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 135 end_POSTSUPERSCRIPT consists of three concatenated components: the body’s translation ti3subscript𝑡𝑖superscript3t_{i}\in\mathbb{R}^{3}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, root orientation ri6subscript𝑟𝑖superscript6r_{i}\in\mathbb{R}^{6}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, and the body pose θi21×6subscript𝜃𝑖superscript216\theta_{i}\in\mathbb{R}^{21\times 6}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 21 × 6 end_POSTSUPERSCRIPT, both in 6D format [49].

For the avatar to reach the goal, it is important to be informed about the spatial relation of the goal location with respect to its current pose, as well as to have a sense of time to reach the goal promptly. We achieve this by introducing the intention features, which are central to our approach.

To define the intention features Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at timestep i𝑖iitalic_i, it is crucial to first establish the selection criteria for the goal G3𝐺superscript3G\ \in\mathbb{R}^{3}italic_G ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, within a motion sequence. Our training involves both label-available scenarios (e.g., CIRCLE dataset) and label-absent scenarios (e.g., AMASS dataset). In label-available cases, where the goal position G𝐺Gitalic_G and the frame index tGsubscript𝑡𝐺t_{G}italic_t start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT which it is reached are known, calculating Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is straightforward. Conversely, for label-absent scenarios like in AMASS, we pick a random future frame as the tGsubscript𝑡𝐺t_{G}italic_t start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and define the human’s right wrist joint location as the goal G𝐺Gitalic_G (Fig. 3).

In both scenarios, the intention features are defined as:

Ii=Ii(pi,G,tG,i).subscript𝐼𝑖subscript𝐼𝑖subscript𝑝𝑖𝐺subscript𝑡𝐺𝑖I_{i}=I_{i}(p_{i},G,t_{G},i).italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_G , italic_t start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_i ) .

These features are essentially a function of the current body pose, the goal position, and time, offering both spatial and temporal insights required for the motion to reach the goal. They are designed to provide sufficient information to reach goals while also enabling test-time generalization. We define them as three distinct components:

Ii=(Iiw,Iir,Iip).subscript𝐼𝑖subscriptsuperscript𝐼𝑤𝑖subscriptsuperscript𝐼𝑟𝑖subscriptsuperscript𝐼𝑝𝑖I_{i}=(I^{w}_{i},I^{r}_{i},I^{p}_{i}).italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_I start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

These components represent wrist intention, body orientation intention, and pelvis intention, respectively.

Wrist Intention: This is the main time-dependent component that guides the wrist to reach the goal. It is calculated as the necessary average velocity for the wrist to be at the goal location in time, defined as:

Iiw=GWitGisubscriptsuperscript𝐼𝑤𝑖𝐺subscript𝑊𝑖subscript𝑡𝐺𝑖I^{w}_{i}=\frac{G-W_{i}}{t_{G}-i}italic_I start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_G - italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - italic_i end_ARG

where tGsubscript𝑡𝐺t_{G}italic_t start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is the frame when the goal should be reached. During training, we know the goal frame and the time to reach it. At test time, we know the goal location but need an externally-supplied time to reach it. An emergent behavior of this formulation is that, at inference, the model is able to adjust its movement speed and reach the goal just in time.

Orientation Intention: This component captures the body orientation when reaching the goal location. By conditioning on this, we ensure that the human model orients towards the goal and smoothly navigates towards it, preventing unnatural motions during inference, like walking backward. During training, this is defined as the difference between the forward direction of the current body frame, Hixysubscriptsuperscript𝐻𝑥𝑦𝑖H^{xy}_{i}italic_H start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the goal body, HtGxysuperscriptsubscript𝐻subscript𝑡𝐺𝑥𝑦H_{t_{G}}^{xy}italic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT where xy𝑥𝑦xyitalic_x italic_y signifies removing the z𝑧zitalic_z component. During inference, since we do not have the goal body, we use the pelvis position Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to calculate the pelvis-to-goal direction as the desired orientation. This feature is formulated as:

Iir={HtGxyHixyduring training(GPi)xyHixyduring inference.subscriptsuperscript𝐼𝑟𝑖casessuperscriptsubscript𝐻subscript𝑡𝐺𝑥𝑦subscriptsuperscript𝐻𝑥𝑦𝑖during trainingsuperscript𝐺subscript𝑃𝑖𝑥𝑦subscriptsuperscript𝐻𝑥𝑦𝑖during inferenceI^{r}_{i}=\begin{cases}H_{t_{G}}^{xy}-H^{xy}_{i}&\text{during training}\\ (G-P_{i})^{xy}-H^{xy}_{i}&\text{during inference}.\\ \end{cases}italic_I start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL italic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT - italic_H start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL during training end_CELL end_ROW start_ROW start_CELL ( italic_G - italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT - italic_H start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL during inference . end_CELL end_ROW

Pelvis Intention: This feature captures information about the position of the goal relative to the body. It is the difference between the goal and the pelvis joint, excluding the z (height) component. Following the approach in [35], we scale this distance by an exponential function that saturates this vector to have a maximum norm of 2222. This formulation helps the method generalize to navigating towards the goal during longer motions and helps the model learn since the distance from the goal does not grow indefinitely in extreme scenarios. This intention is defined by the equation:

Iip=2×(1eGxyPixy2)×GxyPixyGxyPixy2.subscriptsuperscript𝐼𝑝𝑖21superscript𝑒subscriptnormsuperscript𝐺𝑥𝑦subscriptsuperscript𝑃𝑥𝑦𝑖2superscript𝐺𝑥𝑦subscriptsuperscript𝑃𝑥𝑦𝑖subscriptnormsuperscript𝐺𝑥𝑦subscriptsuperscript𝑃𝑥𝑦𝑖2I^{p}_{i}=2\times(1-e^{||G^{xy}-P^{xy}_{i}||_{2}})\times\frac{G^{xy}-P^{xy}_{i% }}{||G^{xy}-P^{xy}_{i}||_{2}}.italic_I start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 × ( 1 - italic_e start_POSTSUPERSCRIPT | | italic_G start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT - italic_P start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) × divide start_ARG italic_G start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT - italic_P start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG | | italic_G start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT - italic_P start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG .

In the experimental section, we delve into the significance of each of these intention features and discuss the rationale behind our design choices, illustrating their impact on the effectiveness of our model.

3.3 Motion Network (WANDR)

WANDR is designed as a conditional Variational Auto-Encoder (c-VAE) network, operating in an autoregressive manner to generate sequential motion frames. This framework is pivotal in predicting the subsequent pose in a motion sequence, emphasizing an incremental, frame-by-frame approach.

Central to our approach is the training of the c-VAE to autoencode pose deltas, denoted as di135subscript𝑑𝑖superscript135d_{i}\in\mathbb{R}^{135}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 135 end_POSTSUPERSCRIPT. These deltas represent the difference between two consecutive poses, pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and pi1subscript𝑝𝑖1p_{i-1}italic_p start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. By focusing on pose deltas rather than absolute pose values, our model benefits from an important inductive bias, enhancing its learning efficiency and performance, as supported by prior research [29, 17]. We separate rotational differences into: body orientation (dirsubscriptsuperscript𝑑𝑟𝑖d^{r}_{i}italic_d start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and body pose (diθsubscriptsuperscript𝑑𝜃𝑖d^{\theta}_{i}italic_d start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), each expressed in a 6-D rotational format. Translation deltas are denoted as dit=titi1subscriptsuperscript𝑑𝑡𝑖subscript𝑡𝑖subscript𝑡𝑖1d^{t}_{i}=t_{i}-t_{i-1}italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT.

To enhance the motion representation’s invariance, we remove information related to the global z-orientation. This is accomplished by subtracting the global z Euler angle of ri1subscript𝑟𝑖1r_{i-1}italic_r start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT from both the translational (ditsubscriptsuperscript𝑑𝑡𝑖d^{t}_{i}italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and rotational (dirsubscriptsuperscript𝑑𝑟𝑖d^{r}_{i}italic_d start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) deltas. The resulting deltas, ditzsubscriptsuperscript𝑑subscript𝑡𝑧𝑖d^{t_{-z}}_{i}italic_d start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT - italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for translation and dirzsubscriptsuperscript𝑑subscript𝑟𝑧𝑖d^{r_{-z}}_{i}italic_d start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT - italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for orientation, provide a more robust and consistent representation of motion, irrespective of global direction. Consequently, the delta pose features for any given frame i𝑖iitalic_i are composed as follows:

di=(ditz,dirz,diθ).subscript𝑑𝑖subscriptsuperscript𝑑subscript𝑡𝑧𝑖subscriptsuperscript𝑑subscript𝑟𝑧𝑖subscriptsuperscript𝑑𝜃𝑖d_{i}=(d^{t_{-z}}_{i},d^{r_{-z}}_{i},d^{\theta}_{i}).italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_d start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT - italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT - italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

An advantage of this representation is its consistency across different motion global orientations. For instance, in the scenario of a person walking, the delta representation remains agnostic to the walking direction. This attribute underscores the efficacy of our method in capturing the essence of motion without being biased towards any specific orientation or direction.

Condition Inputs: For each motion frame, the decoder is conditioned on a combination of state and intention features. Specifically, this condition signal is formulated as ci=(pidyn,Ii)subscript𝑐𝑖subscriptsuperscript𝑝𝑑𝑦𝑛𝑖subscript𝐼𝑖c_{i}=(p^{dyn}_{i},I_{i})italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_p start_POSTSUPERSCRIPT italic_d italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The state features, pidynsubscriptsuperscript𝑝𝑑𝑦𝑛𝑖p^{dyn}_{i}italic_p start_POSTSUPERSCRIPT italic_d italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, encapsulate the avatar’s current local pose, focusing on the z-component of translation and a modified orientation that excludes the global z Euler angle, along with the pose deltas di1subscript𝑑𝑖1d_{i-1}italic_d start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT of the previous step. That combination ensures that the generated motion at each step is informed by both the local pose configuration of the avatar, its dynamics and its directional intention towards the set goal, vital for producing realistic, goal-oriented human motions. An overview of the network architecture is shown in Fig. 2.

3.4 Training Losses

Our training objective is a composite of three distinct loss functions:

=rec+αKL+J.subscript𝑟𝑒𝑐𝛼subscript𝐾𝐿subscript𝐽\mathcal{L}=\mathcal{L}_{rec}+\alpha\mathcal{L}_{KL}+\mathcal{L}_{J}.caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT .

The reconstruction loss, recsubscript𝑟𝑒𝑐\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT, measures the accuracy of the motion reconstruction, quantified as the mean square error (MSE) between the input pose delta, disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and its reconstructed counterpart, d^isubscript^𝑑𝑖\hat{d}_{i}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. It ensures the network’s ability to faithfully replicate the input motion.

The KL Divergence Loss, (KLsubscript𝐾𝐿\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT), evaluates the deviation of the encoded distribution from a standard normal distribution. It is formulated as:

KL=𝒦(𝒩(0,I)||𝒩(μi,σi)).\mathcal{L}_{KL}=\mathcal{KL}(\mathcal{N}(0,I)||\mathcal{N}(\mu_{i},\sigma_{i}% )).caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT = caligraphic_K caligraphic_L ( caligraphic_N ( 0 , italic_I ) | | caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) .

Here, μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the mean and variance of the Gaussian distribution predicted by the encoder. We balance this term with α=102𝛼superscript102\alpha=10^{-2}italic_α = 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT to prevent the over-dominance of KLsubscript𝐾𝐿\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT, thereby aiding the decoder in avoiding collapse to mean predictions.

Finally, we use a Joint Error Loss (Jsubscript𝐽\mathcal{L}_{J}caligraphic_L start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT), to ensure perceptual accuracy, by integrating the predicted d^isubscript^𝑑𝑖\hat{d}_{i}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to get the predicted next pose p^isubscript^𝑝𝑖\hat{p}_{i}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is then fed into the SMPL-X model to obtain the predicted joint positions, J^^𝐽\hat{J}over^ start_ARG italic_J end_ARG. The loss Jsubscript𝐽\mathcal{L}_{J}caligraphic_L start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT is the MSE between these predicted joints J^^𝐽\hat{J}over^ start_ARG italic_J end_ARG and the ground truth joints J𝐽Jitalic_J, addressing errors that might not be apparent in parameter space but are perceptually significant, such as incorrect body orientation.

Notably, our approach does not incorporate any explicit loss functions directly related to reaching a goal. This omission is a deliberate choice, aligning with our method’s emphasis on generalizing to diverse goal-reaching scenarios without being constrained by goal-specific training losses.

3.5 Motion Generation

In the inference phase of WANDR, our primary objective is to generate human motion that is driven towards a specific goal. Using the decoder of the WANDR c-VAE, we iteratively generate and integrate pose deltas. This process is initiated from the starting pose and progressively builds upon each subsequent pose.

The intention features are recalculated at each step based on the current predicted pose and the goal location. They serve as a guiding mechanism, ensuring that the generated motion is consistently oriented towards placing the human’s right wrist on the target.

The user can control the motion’s pace by specifying the time to reach the goal tGsubscript𝑡𝐺t_{G}italic_t start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. This directly affects the wrist intention feature, enabling adjustments from fast to slow motions to suit various scenarios and constraints.

4 Experiments

In this section, we outline the datasets used for training and evaluation and benchmark how each dataset affects the goal-reaching ability and the quality of the generated motions. Furthermore, we compare our approach with several baselines and ablate the effect of the different components of our intention vector.

4.1 Datasets & Processing

Our model is trained on two datasets: AMASS [20] and CIRCLE [4]. AMASS is a collection of 17k sequences, containing a wide range of motion types including long-term navigational skills like walking and turning. CIRCLE, on the other hand, contains 7.2k shorter sequences, each marked with a specific goal reached by a hand. For training, we refine AMASS by excluding sequences where feet are more than 20cm20𝑐𝑚20cm20 italic_c italic_m above the ground, resulting in a combined dataset of nearly 20k sequences. This dataset is split into 80% training, 10% validation, and 10% test sets. All motions are re-sampled to 30 frames per second (fps).

4.2 Evaluation Strategy

Our evaluation procedure aims at testing the degree which WANDR can generalize to generating reaching motions that start from unseen poses and reach the whole range of 3D space around the starting pose. This is why we choose not to evaluate on held-out motion-goal pairs from the training data. Instead, we only hold out initial poses. During evaluation, starting from these unseen poses, we generate motions that attempt to reach goals that uniformly cover the volume of a cylinder centered on the human, including completely out-of-distribution goal locations (see Sup. Mat. Sec. E).

In particular, the set of evaluation goals is defined in a cylindrical coordinate frame by taking all the combinations of (1) 5555 angles equally separating the 360 degrees around the human, (2) 5555 different goal heights ranging from 00 to 1.81.81.81.8 meters and (3) 5555 distances from 0.50.50.50.5 to 5555 meters. We generate motions from 6666 different initial poses, with an 8888-second duration specified for reaching each goal. Five motions are sampled for each pose-goal combination, resulting in 5×5×5×6×5=37505556537505\times 5\times 5\times 6\times 5=37505 × 5 × 5 × 6 × 5 = 3750 unique motion sequences from which our metrics are computed. This setup allows us to thoroughly test our model in diverse scenarios, including long-term movements, navigational skills, and reaching motions at various heights and distances.

Refer to caption
Figure 4: Diverse motion generated with WANDR: Displaying a range of motions generated by WANDR from various initial poses towards arbitrary goals. Examples include navigating towards goals from initial orientations not facing the goal (a, b, c, d), elevating the right hand to reach higher targets (c), and bending down to access goals near the floor (d), showcasing the model’s ability to adapt to novel goal locations.

4.3 Evaluation Metrics

To accurately assess the effectiveness of our approach in generating realistic, goal-oriented human motion, we employ a set of metrics focused on both the ability to successfully reach the intended goal and the naturalness of the motion. These metrics are:

  • Success Rate (SR): This quantifies the percentage of motions where the right wrist reaches within 10101010cm of the goal, indicating successful goal attainment. The criterion for success aligns with that used in [4].

  • Foot Skating (FS): FS evaluates the naturalness of the motion based on foot skating, where a frame is considered as having foot skating if the lowest vertex of the human mesh moves more than 0.66cm0.66𝑐𝑚0.66cm0.66 italic_c italic_m between consecutive frames (adjusted from the 1111cm threshold used in [4] to accommodate our 30fps motion generation).

  • Distance to Goal (DTG): DTG records the closest distance in cm that the right wrist gets to the goal during a motion. This metric offers a nuanced view of the model’s capability to guide the motion towards the goal.

4.4 Results

4.4.1 Quantitative Results

Combining AMASS and CIRCLE: Our evaluation in Table 1 validates our hypothesis about the benefits of training with both the AMASS and CIRCLE datasets. On the one hand, training only on CIRCLE (line 1) is not sufficient for the model to learn necessary navigational skills, such as walking, due to the dataset’s narrow focus or reaching motions. This is apparent from the very high foot-skating. On the other hand, training only on AMASS (line 2) results in high-quality motion generation with low foot-skating, but a relatively low success rate in goal-reaching. The Distance to Goal (DTG) metric, suggests that the model is able to navigate close to the target, but it lacks the precise movement needed to successfully reach for the goal. Using both datasets (line 3) illustrates how our approach effectively merges the broad motion vocabulary of AMASS with the goal-oriented precision of CIRCLE, leading to both high-quality motion and improved goal-reaching capability.

We also compare with GOAL [35], a method that generates human motions that reach and grasp objects. GOAL is trained on GRAB [34], a dataset purely consisting of motions of humans gras** and manipulating objects. Since the relative positioning of the human and the object as well as the motions in GRAB have very small variations, GOAL does not succeed in any of the evaluation configurations (line 4).

  Train Set SR \uparrow FS \downarrow DTG (cm) \downarrow
WANDR/Circle 0% 56% 205.4
WANDR/AMASS 16% 19% 48.0
WANDR 32% 16% 24.8
GOAL [35] 0% 29% 149.2
 
Table 1: We evaluate WANDR trained on different datasets and compare with GOAL [35]. Training solely on CIRCLE results in unrealistic motions, whereas AMASS excels in motion quality but struggles with finer goal-reaching skills. WANDR, leveraging both of what these datasets offer, demonstrates realistic motions as well as better ability to reach goals compared to baselines and existing methods.

Ablation of Intention Features: In order to demonstrate the contribution of each component of the intention feature we conduct an ablation study (Table 2). Using only wrist intention (line 1) results in the lowest foot skating, due to the minimal constraints applied to the motion, allowing for more adaptable motion planning. But wrist intention features, are time-dependent and do not carry information about the absolute distance to the goal. This is why it can lead to the avatar over- or under-shooting and thus achieving a low success ratio (SR). The addition of pelvis intention (line 2) enables the avatar to sense the distance to the goal while the orientation intention (line 3) properly aligns the body to face the goal. Since pelvis and orientation intention add more constraints to the motion, they can sometimes cause more challenging body dynamics and produce motions with higher foot-skating (FS) (e.g. turning around in place instead of walking in a U-turn). We also try removing the intention from the motion prior and optimizing the latent space of a randomly generated initial motion to minimize the distance between the wrist and the goal (VAE + opt). This approach fails since the result is heavily dependent on the initialization of the latent variables of the motion. Our results confirm that each component of the intention feature is essential to achieving the overall performance of the model. For more details on the optimization see Sup. Mat.

  Train Set SR \uparrow FS \downarrow DTG (cm) \downarrow
  WANDR (Iwsuperscript𝐼𝑤I^{w}italic_I start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT) 15% 13% 62.9
WANDR (Iw+Ipsuperscript𝐼𝑤superscript𝐼𝑝I^{w}+I^{p}italic_I start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT + italic_I start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT) 18% 17% 44.9
WANDR (Iw+Irsuperscript𝐼𝑤superscript𝐼𝑟I^{w}+I^{r}italic_I start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT + italic_I start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT) 19% 19% 36.0
WANDR (full intention) 32% 16% 24.8
VAE + opt 3% 4% 217.0
 
Table 2: Ablation Study. We evaluate the impact of each component of the intention vector. We also compare with an optimization baseline that does not use any condition signals. The results highlight the effectiveness of all of the components of intention as well as the fact that the complexity of the task makes “brute-forcing” with optimization unsuccessful.

Success Ratio Distribution: Our decomposition of the model’s success ratio, presented in Fig. 5, offers insights into how the model’s performance varies with respect to different goal positions. WANDR demonstrates a consistent ability to reach goals across various distances (blue) and directions (green). It is more capable at reaching goals that are closer to the natural position of the wrist and do not require extensive bending or stretching (yellow). This trend likely results from the abundance of standing or upright motion sequences in the training data, as opposed to motions involving bending or crouching. This analysis provides valuable information for future improvements and dataset balancing.

4.4.2 Qualitative Results

In Fig. 4, we show a variety of motion sequences generated with our network featuring reaching goals located at varying distances and heights, highlighting the model’s ability to realistically and smoothly orient, navigate, and reach for goals. These goals require actions such as bending down, turning, or stretching upwards. A critical aspect observed is the model’s ability to decelerate as it approaches the goal, seamlessly coordinating body and arm movements to achieve a natural-looking reaching motion. Overall, the qualitative results show that our network generalizes well to novel goal locations while generating realistic motions. For more results please see Sup. Mat. and the video.

[Uncaptioned image]
Figure 5: We show the success rates of reaching goals at various heights, angles, and distances from the initial human pose. It highlights how goal position affects the model in accurately navigating and achieving the goals.

5 Conclusion

In conclusion, our research presents a novel data-driven approach to human motion generation, focusing on the task of reaching arbitrary goals in space. We introduce novel intention features that enable learning both general navigational skills from AMASS and goal-reaching skills from the CIRCLE dataset under the same distribution. We evaluate our model’s ability to reach unseen goals that cover the whole space an avatar should be able to reach around it. The autoregressive design of WANDR demonstrates generalizability in generating realistic human motions that reach unseen goals without requiring any extra guidance information such as a pre-defined trajectory.

Limitations and Future Work: Our approach is not without its limitations. Currently, error accumulation can sometimes bring the avatar to states where it can no longer recover. Additionally, our model shows less proficiency in reaching extremely low or high goals, reflecting a need for more diverse training data encompassing a wider range of body movements. Future work could focus on incorporating realistic gras** mechanisms and interactions with objects, as well as including scene navigation capabilities. This could involve integrating more complex datasets or develo** advanced algorithms capable of understanding and interacting with varied environmental contexts, thereby pushing the boundaries of realistic human motion simulation.
Acknowledgments: Markos Diomataris was supported in part by the Max Planck ETH Center for Learning Systems. We thank P. Ghosh, O. Ben Dov S. Tripathi, for the fruitful discussions and A. Cseke, T. Niewiadomski, T. Alexiadis, T. McConnell, for conducting the user studies.
Conflicts of Interest:


Supplementary Material

Appendix A Introduction

This supplemental material offers more details regarding the use of our method in an optimization framework, the effect of motion duration on the generated motions, different applications of our method, and more qualitative results. Please see the Supplementary Video, where we extensively demonstrate the realism and adaptability of our generated reaching motions across diverse scenarios.

The video effectively contains: (1) the problem and our motivation, (2) our method and key ideas, (3) multiple example motions, and (4) different applications of our method such as extension to reaching dynamic goals. The video serves as a dynamic and illustrative supplement, showcasing our contributions in a manner that is hard to show in a paper format.

Appendix B Technical Implementation

B.1 Model Architecture

WANDR c-VAE architecture [16] employs an Encoder and a Decoder, each composed of fifteen layers in a Multi-Layer Perceptron (MLP) configuration. We integrate relu activation functions, dropout and layer normalization at each stage for enhanced performance. The latent space is represented as a 64-dimensional vector. In our design, the condition signal of the c-VAE is concatenated with the input delta (in the case of the Encoder) and the latent vector (in the case of the Decoder).

B.2 Training Details

We developed and trained our method using the PyTorch framework [24]. We train it for 900 epochs on 4 Tesla V100 GPUs. We use a batch size of 512, resulting in approximately 20 hours of training duration. For optimization, we use Adam [15] with a starting learning rate of 1e41𝑒41e-41 italic_e - 4 that linearly decreases to 1e51𝑒51e-51 italic_e - 5 during training.

A crucial aspect of our training regimen includes performing a teacher-forcing method, which involves feeding the model’s own predictions back into the input. This process facilitates the Decoder network in acquiring the capability to compensate for potential errors that may arise during the prediction of deltas. During the whole process of the training, the c-VAE is being trained on the task of auto-encoding the motion deltas. As the training progresses, we additionally perform motion generation for a few steps. Specifically, we reconstruct the deltas, integrate them to obtain the subsequent pose, and then sample from the latent space while conditioning on the generated pose. We repeat this process for up to s𝑠sitalic_s steps increasing the s𝑠sitalic_s linearly from 00 up to 10 along the span of 50 epochs and then kee** it fixed.

B.3 Optimization Details

The formulation of our method as a c-VAE provides us with a smooth latent space that allows us to search this manifold in an optimization process to reach various target goals. For this, we apply specific constraints to the decoder’s output, i.e. the body poses, and optimize the latent space representation of the poses to achieve the desired motions.

In Tab. 2 of the main manuscript, we explore how optimizing a trained motion prior performs compared to our method without any optimization, in achieving various goals. The results show that WANDR without optimization performs better than the other methods, even with optimization.

Refer to caption
Figure S.1: Generated motions for different goals with the same time constraint (a) and for a similar goal with varying time constraints (b). The results show that our dynamic intent features enable the adaptability of WANDR to generate time-controlled motions.

To do the optimization, we first generate a motion to reach a goal using WANDR, and then refine the motion through optimization aiming to align the wrist’s position in the final frame more closely with the target goal.

In order to achieve this, we employ a dual-component loss function:

opt=norm+goalsubscript𝑜𝑝𝑡subscript𝑛𝑜𝑟𝑚subscript𝑔𝑜𝑎𝑙\mathcal{L}_{opt}=\mathcal{L}_{norm}+\mathcal{L}_{goal}caligraphic_L start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT

Here, normsubscript𝑛𝑜𝑟𝑚\mathcal{L}_{norm}caligraphic_L start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m end_POSTSUBSCRIPT represents the log-likelihood of the motion’s latent vectors under a normal distribution. goalsubscript𝑔𝑜𝑎𝑙\mathcal{L}_{goal}caligraphic_L start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT calculates the mean square error between the wrist’s final frame location and the goal. normsubscript𝑛𝑜𝑟𝑚\mathcal{L}_{norm}caligraphic_L start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m end_POSTSUBSCRIPT seeks to maintain the generated motion within plausible human movements, while goalsubscript𝑔𝑜𝑎𝑙\mathcal{L}_{goal}caligraphic_L start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT specifically tunes the motion to bring the wrist in proximity to the goal in the final frame. Notably, even though goalsubscript𝑔𝑜𝑎𝑙\mathcal{L}_{goal}caligraphic_L start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT is applied only on the final frame, its gradient flows across the whole sequence since the autoregressive generation process is fully differentiable.

Appendix C shows our method’s ability to produce motions with different time durations, as well as its integration with the optimization framework presented in the main paper.

Appendix C WANDR Applications

In this section we show that our method can be used in various scenarios and for different applications.

C.1 Time-controlled Motions

One key aspect of our intention features is the dependency of the wrist-intention vector on the goal-reaching time, enabling the generation of time-controlled motions. As mentioned in the main paper, this vector is computed by dividing the distance from the wrist to the goal by the time remaining to reach it. During inference, by changing the reaching time or distance, the generated motions adapt and become rapid or slow. Figure S.1 presents two scenarios: (a) reaching different goals within the same time duration, and (b) reaching a goal within different time durations. These cases illustrate how the motion varies in response to the time and distance parameters. For qualitative examples, please see the Supplementary Video.

C.2 Optimization-enabled extensions

Refer to caption
Figure S.2: A demonstration of WANDR generating a motion sequence to achieve multiple goals. The intention features are recalculated and updated after each iteration of the autoregressive process, enabling dynamic goal adjustments during the motion generation.

Multi-goal Reaching: We show that our unique intention features enable the generation of motions to achieve multiple goals sequentially. Although trained for single-goal achievement, the dynamic nature of our intention features allows for multiple goal definitions during inference. These features are recalculated and updated at each iteration of WANDR’s autoregressive process, adapting to changes in goal locations. Figure S.2 illustrates this capability, where a motion sequence is generated to achieve several goals. This adaptability also extends to tracking and following moving targets, as shown in our Supplementary Video.

Waypoint Following: Our method extends to the application of reaching goals while at the same time having the virtual human passing through arbitrary waypoints. Specifically, we are able to choose a waypoint and have the human pass from it at a chosen frame, while still reaching for the goal at the end of the motion. To achieve this, we extend the optimization approach described in section Sec. B.3 with the addition of an extra mean square error loss between the ground projection of the pelvis and the waypoint location for a particular frame. Fig. S.3 showcases an example of this application, underlining the adaptability of WANDR in navigating through waypoints while simultaneously reaching a goal (please see video for an example motion). It is worth noting that passing through waypoints can be trivially extended to following trajectories, since a trajectory can be approximated by sampled waypoints on a curve.

Refer to caption
Figure S.3: A generated motion from WANDR for waypoint (blue sphere) following while reaching for a goal (pink sphere). The initial generation of WANDR follows the motion marked with the black arrows. After optimization, the motion manages to pass through the waypoint, while still reaching for the goal at the end of the motion. This illustrates that WANDR provides a smooth latent space for the motions that can be aligned with the goal-reaching motion with predefined waypoints using an optimization process.

C.3 Extending WANDR to other joints

WANDR can be, in a straight forward way, extended to other joints just by replacing the wrist position with the joint of interest in the definition of Iwsuperscript𝐼𝑤I^{w}italic_I start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT. For example, by replacing the wrist with the pelvis joint, we can get a motion generator that can produce motions that follow waypoints. However, we note that controlling multiple body joints simultaneously is not trivial with the current design. It would require redesigning the intention features to enable learning which joint corresponds to which intention feature.

Appendix D Perceptual Study

Throughout our experiments, we empirically found that the foot skating metric has a high correlation with the quality of the motion. Nevertheless, in order to properly evaluate the perceptual quality of WANDR’s generated motions we conduct two perceptual studies through amazon mechanical turk. The studies aim at quantifying how close WANDR’s motions are perceptually compared to real human motions taken from AMASS. In the first study, users rate the realism of the motions using with a 5-level Likert scale (1 \rightarrow non-realistic & 5 \rightarrow realistic). Only one motion is shown at a time. In the second study, users are asked to choose the most realistic motion between two, one coming from WANDR and one coming from an AMASS sequence. We clip motions to a 2222 second duration and only show motions from WANDR that succeeded in reaching their goal.

In the first study, AMASS ground-truth motions score 3.8±1/5superscript3.8plus-or-minus153.8^{\pm 1}/53.8 start_POSTSUPERSCRIPT ± 1 end_POSTSUPERSCRIPT / 5 vs 3.4±1/5superscript3.4plus-or-minus153.4^{\pm 1}/53.4 start_POSTSUPERSCRIPT ± 1 end_POSTSUPERSCRIPT / 5 for WANDR. The comparative study finds that 30.2%percent30.230.2\%30.2 % of the users preferred WANDR motions over AMASS. These findings indicate that the WANDR motions are perceptually close to real motions.

Appendix E Evaluation Distribution

Refer to caption
Figure S.4: Overlay of the distribution of training pseudo-goals with the evaluation goals (×\times×) of WANDR in all pairwise combinations of their cylindrical coordinates. Our evaluation goals uniformly cover a range of goals both outside and inside the training distribution.

To better demonstrate that WANDR has been evaluated on out-of-distribution data, in Fig. S.4 we visualize the density of the pseudo-goal training locations (in cylindrical coordinates) and overlay the goal locations (marked as ×\times×) used to evaluate WANDR. We clearly observe that most evaluation goals lie on either low probability or unseen locations.

References

  • Aksan et al. [2021] Emre Aksan, Manuel Kaufmann, Peng Cao, and Otmar Hilliges. A spatio-temporal transformer for 3d human motion prediction. In 2021 International Conference on 3D Vision (3DV), pages 565–574. IEEE, 2021.
  • Aliakbarian et al. [2020] Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Lars Petersson, and Stephen Gould. A stochastic conditioning scheme for diverse human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5223–5232, 2020.
  • Andrychowicz et al. [2017] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, 2017.
  • Araújo et al. [2023] Joao Pedro Araújo, Jiaman Li, Karthik Vetrivel, Rishi Agarwal, Jiajun Wu, Deepak Gopinath, Alexander William Clegg, and Karen Liu. Circle: Capture in rich contextual environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21211–21221, 2023.
  • Braun et al. [2024] Jona Braun, Sammy Christen, Muhammed Kocabas, Emre Aksan, and Otmar Hilliges. Physically plausible full-body hand-object interaction synthesis. In International Conference on 3D Vision (3DV), 2024.
  • Cai et al. [2021] Yujun Cai, Yiwei Wang, Yiheng Zhu, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Chuanxia Zheng, Sijie Yan, Henghui Ding, et al. A unified 3D human motion synthesis model via conditional variational auto-encoder. In Computer Vision and Pattern Recognition (CVPR), pages 11645–11655, 2021.
  • Cao et al. [2020] Zhe Cao, Hang Gao, Karttikeya Mangalam, Qi-Zhi Cai, Minh Vo, and Jitendra Malik. Long-term human motion prediction with scene context. In European Conference on Computer Vision (ECCV), pages 387–404. Springer, 2020.
  • Fan et al. [2023] Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, and Otmar Hilliges. ARCTIC: A dataset for dexterous bimanual hand-object manipulation. In Computer Vision and Pattern Recognition (CVPR), 2023.
  • Ghosh et al. [2017] Partha Ghosh, Jie Song, Emre Aksan, and Otmar Hilliges. Learning human motion models for long-term predictions. In 2017 International Conference on 3D Vision (3DV), pages 458–466. IEEE, 2017.
  • Hassan et al. [2021] Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun Saito, Jimei Yang, Yi Zhou, and Michael Black. Stochastic scene-aware motion prediction. arXiv:2108.08284, 2021.
  • Hassan et al. [2023] Mohamed Hassan, Yunrong Guo, Tingwu Wang, Michael Black, Sanja Fidler, and Xue Bin Peng. Synthesizing physical character-scene interactions. In International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), 2023.
  • Hasson et al. [2019] Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated objects. arXiv:1904.05767, 2019.
  • Huang et al. [2022] Chun-Hao P Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J Black. Capturing and inferring dense full-body human-scene contact. In Computer Vision and Pattern Recognition (CVPR), pages 13274–13285, 2022.
  • Karunratanakul et al. [2023] Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Gmd: Controllable human motion synthesis via guided diffusion models. In Computer Vision and Pattern Recognition (CVPR), 2023.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma and Welling [2014] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR), 2014.
  • Ling et al. [2020a] Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel Van De Panne. Character controllers using motion VAEs. Transactions on Graphics (TOG), 2020a.
  • Ling et al. [2020b] Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel Van De Panne. Character controllers using motion vaes. Transactions on Graphics (TOG), 2020b.
  • Loper et al. [2014] Matthew Loper, Naureen Mahmood, and Michael J Black. Mosh: motion and shape capture from sparse markers. International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), 33(6):220–1, 2014.
  • Mahmood et al. [2019] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision (ICCV), 2019.
  • Martinez et al. [2017] Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017.
  • Mir et al. [2024] Aymen Mir, Xavier Puig, Angjoo Kanazawa, and Gerard Pons-Moll. Generating continual human motion in diverse 3d scenes. In International Conference on 3D Vision (3DV), 2024.
  • Ormoneit et al. [2000] Dirk Ormoneit, Hedvig Sidenbladh, Michael Black, and Trevor Hastie. Learning and tracking cyclic human motion. Conference on Neural Information Processing Systems (NeurIPS), 13, 2000.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
  • Pavlakos et al. [2019] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
  • Pavlovic et al. [2000] Vladimir Pavlovic, James M Rehg, and John MacCormick. Learning switching linear models of human motion. Conference on Neural Information Processing Systems (NeurIPS), 13, 2000.
  • Peng et al. [2021] Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: adversarial motion priors for stylized physics-based character control. Transactions on Graphics (TOG), 2021.
  • Petrovich et al. [2021] Mathis Petrovich, Michael J Black, and Gül Varol. Action-conditioned 3d human motion synthesis with transformer vae. In Computer Vision and Pattern Recognition (CVPR), 2021.
  • Rempe et al. [2021] Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J. Guibas. Humor: 3d human motion model for robust pose estimation. In International Conference on Computer Vision (ICCV), 2021.
  • Shu et al. [2021] Xiangbo Shu, Liyan Zhang, Guo-Jun Qi, Wei Liu, and **hui Tang. Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(6):3300–3315, 2021.
  • Sidenbladh et al. [2000] Hedvig Sidenbladh, Michael J Black, and David J Fleet. Stochastic tracking of 3d human figures using 2d image motion. In European Conference on Computer Vision (ECCV). Springer, 2000.
  • Sigal et al. [2010] Leonid Sigal, Alexandru O Balan, and Michael J Black. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision (IJCV), 87(1-2):4–27, 2010.
  • Starke et al. [2019] Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. Neural state machine for character-scene interactions. Transactions on Graphics (TOG), 2019.
  • Taheri et al. [2020] Omid Taheri, Nima Ghorbani, Michael J. Black, and Dimitrios Tzionas. GRAB: A dataset of whole-body human gras** of objects. In European Conference on Computer Vision (ECCV), 2020.
  • Taheri et al. [2022] Omid Taheri, Vasileios Choutas, Michael J. Black, and Dimitrios Tzionas. GOAL: Generating 4D whole-body motion for hand-object gras**. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Taheri et al. [2023] Omid Taheri, Yi Zhou, Dimitrios Tzionas, Yang Zhou, Duygu Ceylan, Soren Pirk, and Michael J. Black. Grip: Generating interaction poses using latent consistency and spatial cues, 2023.
  • Tevet et al. [2023] Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2023.
  • Trumble et al. [2017] Matthew Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and John Collomosse. Total capture: 3d human pose estimation fusing video and inertial sensors. In British Machine Vision Conference (BMVC), pages 1–13, 2017.
  • Urtasun et al. [2006] Raquel Urtasun, David J Fleet, and Pascal Fua. Temporal motion models for monocular and multiview 3d human body tracking. Computer Vision and Image Understanding (CVIU), 104(2-3):157–177, 2006.
  • Wang et al. [2020a] Qi Wang, Thierry Artières, Mickael Chen, and Ludovic Denoyer. Adversarial learning for modeling human motion. The Visual Computer, 36(1):141–160, 2020a.
  • Wang et al. [2020b] Zhenyi Wang, ** Yu, Yang Zhao, Ruiyi Zhang, Yufan Zhou, Junsong Yuan, and Changyou Chen. Learning diverse stochastic human-action generators by learning smooth latent transitions. In Proceedings of the AAAI conference on artificial intelligence, pages 12281–12288, 2020b.
  • Xu et al. [2023] Liang Xu, Ziyang Song, Dongliang Wang, **g Su, Zhicheng Fang, Chen**g Ding, Weihao Gan, Yichao Yan, Xin **, Xiaokang Yang, et al. Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. In International Conference on Computer Vision (ICCV), pages 2228–2238, 2023.
  • Yu et al. [2020] ** Yu, Yang Zhao, Chunyuan Li, Junsong Yuan, and Changyou Chen. Structure-aware human-action generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 18–34. Springer, 2020.
  • Zhang et al. [2022a] Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022a.
  • Zhang et al. [2022b] Xiaohan Zhang, Bharat Lal Bhatnagar, Sebastian Starke, Vladimir Guzov, and Gerard Pons-Moll. Couch: Towards controllable human-chair interactions. In European Conference on Computer Vision, pages 518–535. Springer, 2022b.
  • Zhang and Tang [2022] Yan Zhang and Siyu Tang. The wanderings of odysseus in 3D scenes. In Computer Vision and Pattern Recognition (CVPR), 2022.
  • Zhao et al. [2023] Kaifeng Zhao, Yan Zhang, Shaofei Wang, Thabo Beeler, , and Siyu Tang. Synthesizing diverse human motions in 3d indoor scenes. In International conference on computer vision (ICCV), 2023.
  • Zheng et al. [2022] Yang Zheng, Yanchao Yang, Kaichun Mo, Jiaman Li, Tao Yu, Yebin Liu, C Karen Liu, and Leonidas J Guibas. Gimo: Gaze-informed human motion prediction in context. In European Conference on Computer Vision (ECCV), pages 676–694. Springer, 2022.
  • Zhou et al. [2019] Yi Zhou, Connelly Barnes, **gwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019.