Search | arXiv e-print repository

WANDR: Intention-guided Human Motion Generation

Authors: Markos Diomataris, Nikos Athanasiou, Omid Taheri, Xi Wang, Otmar Hilliges, Michael J. Black

Abstract: Synthesizing natural human motions that enable a 3D human avatar to walk and reach for arbitrary goals in 3D space remains an unsolved problem with many applications. Existing methods (data-driven or using reinforcement learning) are limited in terms of generalization and motion naturalness. A primary obstacle is the scarcity of training data that combines locomotion with goal reaching. To address… ▽ More Synthesizing natural human motions that enable a 3D human avatar to walk and reach for arbitrary goals in 3D space remains an unsolved problem with many applications. Existing methods (data-driven or using reinforcement learning) are limited in terms of generalization and motion naturalness. A primary obstacle is the scarcity of training data that combines locomotion with goal reaching. To address this, we introduce WANDR, a data-driven model that takes an avatar's initial pose and a goal's 3D position and generates natural human motions that place the end effector (wrist) on the goal location. To solve this, we introduce novel intention features that drive rich goal-oriented movement. Intention guides the agent to the goal, and interactively adapts the generation to novel situations without needing to define sub-goals or the entire motion path. Crucially, intention allows training on datasets that have goal-oriented motions as well as those that do not. WANDR is a conditional Variational Auto-Encoder (c-VAE), which we train using the AMASS and CIRCLE datasets. We evaluate our method extensively and demonstrate its ability to generate natural and long-term motions that reach 3D goals and generalize to unseen goal locations. Our models and code are available for research purposes at wandr.is.tue.mpg.de. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2308.11617 [pdf, other]

GRIP: Generating Interaction Poses Using Latent Consistency and Spatial Cues

Authors: Omid Taheri, Yi Zhou, Dimitrios Tzionas, Yang Zhou, Duygu Ceylan, Soren Pirk, Michael J. Black

Abstract: Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality. Prior work on capturing and modeling humans interacting with objects in 3… ▽ More Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality. Prior work on capturing and modeling humans interacting with objects in 3D focuses on the body and object motion, often ignoring hand pose. In contrast, we introduce GRIP, a learning-based method that takes, as input, the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction. As a preliminary step before synthesizing the hand motion, we first use a network, ANet, to denoise the arm motion. Then, we leverage the spatio-temporal relationship between the body and the object to extract two types of novel temporal interaction cues, and use them in a two-stage inference pipeline to generate the hand motion. In the first stage, we introduce a new approach to enforce motion temporal consistency in the latent space (LTC), and generate consistent interaction motions. In the second stage, GRIP generates refined hand poses to avoid hand-object penetrations. Given sequences of noisy body and object motion, GRIP upgrades them to include hand-object interaction. Quantitative experiments and perceptual studies demonstrate that GRIP outperforms baseline methods and generalizes to unseen objects and motions from different motion-capture datasets. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: The project has been started during Omid Taheri's internship at Adobe and as a collaboration with the Max Planck Institute for Intelligent Systems

arXiv:2303.18246 [pdf, other]

3D Human Pose Estimation via Intuitive Physics

Authors: Shashank Tripathi, Lea Müller, Chun-Hao P. Huang, Omid Taheri, Michael J. Black, Dimitrios Tzionas

Abstract: Estimating 3D humans from images often produces implausible bodies that lean, float, or penetrate the floor. Such methods ignore the fact that bodies are typically supported by the scene. A physics engine can be used to enforce physical plausibility, but these are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks… ▽ More Estimating 3D humans from images often produces implausible bodies that lean, float, or penetrate the floor. Such methods ignore the fact that bodies are typically supported by the scene. A physics engine can be used to enforce physical plausibility, but these are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks. In contrast, we exploit novel intuitive-physics (IP) terms that can be inferred from a 3D SMPL body interacting with the scene. Inspired by biomechanics, we infer the pressure heatmap on the body, the Center of Pressure (CoP) from the heatmap, and the SMPL body's Center of Mass (CoM). With these, we develop IPMAN, to estimate a 3D body from a color image in a "stable" configuration by encouraging plausible floor contact and overlap** CoP and CoM. Our IP terms are intuitive, easy to implement, fast to compute, differentiable, and can be integrated into existing optimization and regression methods. We evaluate IPMAN on standard datasets and MoYo, a new dataset with synchronized multi-view images, ground-truth 3D bodies with complex poses, body-floor contact, CoM and pressure. IPMAN produces more plausible results than the state of the art, improving accuracy for static poses, while not hurting dynamic ones. Code and data are available for research at https://ipman.is.tue.mpg.de. △ Less

Submitted 24 July, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

Comments: Accepted in CVPR'23. Project page: https://ipman.is.tue.mpg.de

arXiv:2204.13662 [pdf, other]

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation

Authors: Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, Otmar Hilliges

Abstract: Humans intuitively understand that inanimate objects do not move by themselves, but that state changes are typically caused by human manipulation (e.g., the opening of a book). This is not yet the case for machines. In part this is because there exist no datasets with ground-truth 3D annotations for the study of physically consistent and synchronised motion of hands and articulated objects. To thi… ▽ More Humans intuitively understand that inanimate objects do not move by themselves, but that state changes are typically caused by human manipulation (e.g., the opening of a book). This is not yet the case for machines. In part this is because there exist no datasets with ground-truth 3D annotations for the study of physically consistent and synchronised motion of hands and articulated objects. To this end, we introduce ARCTIC -- a dataset of two hands that dexterously manipulate objects, containing 2.1M video frames paired with accurate 3D hand and object meshes and detailed, dynamic contact information. It contains bi-manual articulation of objects such as scissors or laptops, where hand poses and object states evolve jointly in time. We propose two novel articulated hand-object interaction tasks: (1) Consistent motion reconstruction: Given a monocular video, the goal is to reconstruct two hands and articulated objects in 3D, so that their motions are spatio-temporally consistent. (2) Interaction field estimation: Dense relative hand-object distances must be estimated from images. We introduce two baselines ArcticNet and InterField, respectively and evaluate them qualitatively and quantitatively on ARCTIC. Our code and data are available at https://arctic.is.tue.mpg.de. △ Less

Submitted 23 April, 2023; v1 submitted 28 April, 2022; originally announced April 2022.

Comments: Project page: https://arctic.is.tue.mpg.de

arXiv:2112.11454 [pdf, other]

GOAL: Generating 4D Whole-Body Motion for Hand-Object Gras**

Authors: Omid Taheri, Vasileios Choutas, Michael J. Black, Dimitrios Tzionas

Abstract: Generating digital humans that move realistically has many applications and is widely studied, but existing methods focus on the major limbs of the body, ignoring the hands and head. Hands have been separately studied, but the focus has been on generating realistic static grasps of objects. To synthesize virtual characters that interact with the world, we need to generate full-body motions and rea… ▽ More Generating digital humans that move realistically has many applications and is widely studied, but existing methods focus on the major limbs of the body, ignoring the hands and head. Hands have been separately studied, but the focus has been on generating realistic static grasps of objects. To synthesize virtual characters that interact with the world, we need to generate full-body motions and realistic hand grasps simultaneously. Both sub-problems are challenging on their own and, together, the state-space of poses is significantly larger, the scales of hand and body motions differ, and the whole-body posture and the hand grasp must agree, satisfy physical constraints, and be plausible. Additionally, the head is involved because the avatar must look at the object to interact with it. For the first time, we address the problem of generating full-body, hand and head motions of an avatar gras** an unknown object. As input, our method, called GOAL, takes a 3D object, its position, and a starting 3D body pose and shape. GOAL outputs a sequence of whole-body poses using two novel networks. First, GNet generates a goal whole-body grasp with a realistic body, head, arm, and hand pose, as well as hand-object contact. Second, MNet generates the motion between the starting and goal pose. This is challenging, as it requires the avatar to walk towards the object with foot-ground contact, orient the head towards it, reach out, and grasp it with a realistic hand pose and hand-object contact. To achieve this, the networks exploit a representation that combines SMPL-X body parameters and 3D vertex offsets. We train and evaluate GOAL, both qualitatively and quantitatively, on the GRAB dataset. Results show that GOAL generalizes well to unseen objects, outperforming baselines. GOAL takes a step towards synthesizing realistic full-body object gras**. △ Less

Submitted 16 March, 2023; v1 submitted 21 December, 2021; originally announced December 2021.

arXiv:2011.00574 [pdf]

Human Leg Motion Tracking by Fusing IMUs and RGB Camera Data Using Extended Kalman Filter

Authors: Omid Taheri, Hassan Salarieh, Aria Alasty

Abstract: Human motion capture is frequently used to study rehabilitation and clinical problems, as well as to provide realistic animation for the entertainment industry. IMU-based systems, as well as Marker-based motion tracking systems, are the most popular methods to track movement due to their low cost of implementation and lightweight. This paper proposes a quaternion-based Extended Kalman filter appro… ▽ More Human motion capture is frequently used to study rehabilitation and clinical problems, as well as to provide realistic animation for the entertainment industry. IMU-based systems, as well as Marker-based motion tracking systems, are the most popular methods to track movement due to their low cost of implementation and lightweight. This paper proposes a quaternion-based Extended Kalman filter approach to recover the human leg segments motions with a set of IMU sensors data fused with camera-marker system data. In this paper, an Extended Kalman Filter approach is developed to fuse the data of two IMUs and one RGB camera for human leg motion tracking. Based on the complementary properties of the inertial sensors and camera-marker system, in the introduced new measurement model, the orientation data of the upper leg and the lower leg is updated through three measurement equations. The positioning of the human body is made possible by the tracked position of the pelvis joint by the camera marker system. A mathematical model has been utilized to estimate joints' depth in 2D images. The efficiency of the proposed algorithm is evaluated by an optical motion tracker system. △ Less

Submitted 7 December, 2020; v1 submitted 1 November, 2020; originally announced November 2020.

Comments: This paper results from O. Taheri's MSc Thesis (2017) at the Sharif University of Technology

arXiv:2008.11200 [pdf, other]

doi 10.1007/978-3-030-58548-8_34

GRAB: A Dataset of Whole-Body Human Gras** of Objects

Authors: Omid Taheri, Nima Ghorbani, Michael J. Black, Dimitrios Tzionas

Abstract: Training computers to understand, model, and synthesize human gras** requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time. While "gras**" is commonly thought of as a single hand stably lifting an object, we capture the motion of the entire body and adopt the generalized notion of "whole-body grasps". Th… ▽ More Training computers to understand, model, and synthesize human gras** requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time. While "gras**" is commonly thought of as a single hand stably lifting an object, we capture the motion of the entire body and adopt the generalized notion of "whole-body grasps". Thus, we collect a new dataset, called GRAB (GRas** Actions with Bodies), of whole-body grasps, containing full 3D shape and pose sequences of 10 subjects interacting with 51 everyday objects of varying shape and size. Given MoCap markers, we fit the full 3D body shape and pose, including the articulated face and hands, as well as the 3D object pose. This gives detailed 3D meshes over time, from which we compute contact between the body and object. This is a unique dataset, that goes well beyond existing ones for modeling and understanding how humans grasp and manipulate objects, how their full body is involved, and how interaction varies with the task. We illustrate the practical value of GRAB with an example application; we train GrabNet, a conditional generative network, to predict 3D hand grasps for unseen 3D object shapes. The dataset and code are available for research purposes at https://grab.is.tue.mpg.de. △ Less

Submitted 25 August, 2020; originally announced August 2020.

Comments: ECCV 2020

arXiv:1401.3566 [pdf, ps, other]

Reweighted l1-norm Penalized LMS for Sparse Channel Estimation and Its Analysis

Authors: Omid Taheri, Sergiy A. Vorobyov

Abstract: A new reweighted l1-norm penalized least mean square (LMS) algorithm for sparse channel estimation is proposed and studied in this paper. Since standard LMS algorithm does not take into account the sparsity information about the channel impulse response (CIR), sparsity-aware modifications of the LMS algorithm aim at outperforming the standard LMS by introducing a penalty term to the standard LMS c… ▽ More A new reweighted l1-norm penalized least mean square (LMS) algorithm for sparse channel estimation is proposed and studied in this paper. Since standard LMS algorithm does not take into account the sparsity information about the channel impulse response (CIR), sparsity-aware modifications of the LMS algorithm aim at outperforming the standard LMS by introducing a penalty term to the standard LMS cost function which forces the solution to be sparse. Our reweighted l1-norm penalized LMS algorithm introduces in addition a reweighting of the CIR coefficient estimates to promote a sparse solution even more and approximate l0-pseudo-norm closer. We provide in depth quantitative analysis of the reweighted l1-norm penalized LMS algorithm. An expression for the excess mean square error (MSE) of the algorithm is also derived which suggests that under the right conditions, the reweighted l1-norm penalized LMS algorithm outperforms the standard LMS, which is expected. However, our quantitative analysis also answers the question of what is the maximum sparsity level in the channel for which the reweighted l1-norm penalized LMS algorithm is better than the standard LMS. Simulation results showing the better performance of the reweighted l1-norm penalized LMS algorithm compared to other existing LMS-type algorithms are given. △ Less

Submitted 15 January, 2014; originally announced January 2014.

Comments: 28 pages, 4 figures, 1 table, Submitted to Signal Processing on June 2013

Journal ref: O. Taheri and S.A. Vorobyov, "Reweighted l1-norm penalized LMS for sparse channel estimation and its analysis," Signal Processing, vol. 104, pp. 70-79, Nov. 2014

arXiv:1305.4980 [pdf, ps, other]

doi 10.1109/TSP.2013.2284762

Permutation Meets Parallel Compressed Sensing: How to Relax Restricted Isometry Property for 2D Sparse Signals

Authors: Hao Fang, Sergiy A. Vorobyov, Hai Jiang, Omid Taheri

Abstract: Traditional compressed sensing considers sampling a 1D signal. For a multidimensional signal, if reshaped into a vector, the required size of the sensing matrix becomes dramatically large, which increases the storage and computational complexity significantly. To solve this problem, we propose to reshape the multidimensional signal into a 2D signal and sample the 2D signal using compressed sensing… ▽ More Traditional compressed sensing considers sampling a 1D signal. For a multidimensional signal, if reshaped into a vector, the required size of the sensing matrix becomes dramatically large, which increases the storage and computational complexity significantly. To solve this problem, we propose to reshape the multidimensional signal into a 2D signal and sample the 2D signal using compressed sensing column by column with the same sensing matrix. It is referred to as parallel compressed sensing, and it has much lower storage and computational complexity. For a given reconstruction performance of parallel compressed sensing, if a so-called acceptable permutation is applied to the 2D signal, we show that the corresponding sensing matrix has a smaller required order of restricted isometry property condition, and thus, storage and computation requirements are further lowered. A zigzag-scan-based permutation, which is shown to be particularly useful for signals satisfying a layer model, is introduced and investigated. As an application of the parallel compressed sensing with the zigzag-scan-based permutation, a video compression scheme is presented. It is shown that the zigzag-scan-based permutation increases the peak signal-to-noise ratio of reconstructed images and video frames. △ Less

Submitted 21 May, 2013; originally announced May 2013.

Comments: 30 pages, 10 figures, 3 tables, submitted to the IEEE Trans. Signal Processing in November 2012

Journal ref: H. Fang, S.A. Vorobyov, H. Jiang, O. Taheri, "Permutation meets parallel compressed sensing. How to relax RIP for 2D srarse signals," IEEE Trans. Signal Processing, vol. 62, no. 1, pp. 196-210, Jan. 2014

arXiv:1004.4308 [pdf, ps, other]

doi 10.1109/TSP.2010.2091411

Segmented compressed sampling for analog-to-information conversion: Method and performance analysis

Authors: Omid Taheri, Sergiy A. Vorobyov

Abstract: A new segmented compressed sampling method for analog-to-information conversion (AIC) is proposed. An analog signal measured by a number of parallel branches of mixers and integrators (BMIs), each characterized by a specific random sampling waveform, is first segmented in time into $M$ segments. Then the sub-samples collected on different segments and different BMIs are reused so that a larger num… ▽ More A new segmented compressed sampling method for analog-to-information conversion (AIC) is proposed. An analog signal measured by a number of parallel branches of mixers and integrators (BMIs), each characterized by a specific random sampling waveform, is first segmented in time into $M$ segments. Then the sub-samples collected on different segments and different BMIs are reused so that a larger number of samples than the number of BMIs is collected. This technique is shown to be equivalent to extending the measurement matrix, which consists of the BMI sampling waveforms, by adding new rows without actually increasing the number of BMIs. We prove that the extended measurement matrix satisfies the restricted isometry property with overwhelming probability if the original measurement matrix of BMI sampling waveforms satisfies it. We also show that the signal recovery performance can be improved significantly if our segmented AIC is used for sampling instead of the conventional AIC. Simulation results verify the effectiveness of the proposed segmented compressed sampling method and the validity of our theoretical studies. △ Less

Submitted 24 April, 2010; originally announced April 2010.

Comments: 32 pages, 5 figures, submitted to the IEEE Transactions on Signal Processing in April 2010

Journal ref: O. Taheri and S.A. Vorobyov, "Segmented compressed sampling for analog-to-information conversion: Method and performance analysis," IEEE Trans. Signal Processing, vol. 59, no. 2, pp. 554-572, Feb. 2011

Showing 1–10 of 10 results for author: Taheri, O