Search | arXiv e-print repository

arXiv:2406.19811 [pdf, ps, other]

EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

Authors: Daiwei Zhang, Gengyan Li, Jiajie Li, Mickaël Bressieux, Otmar Hilliges, Marc Pollefeys, Luc Van Gool, Xi Wang

Abstract: Human activities are inherently complex, and even simple household tasks involve numerous object interactions. To better understand these activities and behaviors, it is crucial to model their dynamic interactions with the environment. The recent availability of affordable head-mounted cameras and egocentric data offers a more accessible and efficient means to understand dynamic human-object inter… ▽ More Human activities are inherently complex, and even simple household tasks involve numerous object interactions. To better understand these activities and behaviors, it is crucial to model their dynamic interactions with the environment. The recent availability of affordable head-mounted cameras and egocentric data offers a more accessible and efficient means to understand dynamic human-object interactions in 3D environments. However, most existing methods for human activity modeling either focus on reconstructing 3D models of hand-object or human-scene interactions or on map** 3D scenes, neglecting dynamic interactions with objects. The few existing solutions often require inputs from multiple sources, including multi-camera setups, depth-sensing cameras, or kinesthetic sensors. To this end, we introduce EgoGaussian, the first method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We leverage the uniquely discrete nature of Gaussian Splatting and segment dynamic interactions from the background. Our approach employs a clip-level online learning pipeline that leverages the dynamic nature of human activities, allowing us to reconstruct the temporal evolution of the scene in chronological order and track rigid object motion. Additionally, our method automatically segments object and background Gaussians, providing 3D representations for both static scenes and dynamic objects. EgoGaussian outperforms previous NeRF and Dynamic Gaussian methods in challenging in-the-wild videos and we also qualitatively demonstrate the high quality of the reconstructed models. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.08472 [pdf, other]

RILe: Reinforced Imitation Learning

Authors: Mert Albaba, Sammy Christen, Christoph Gebhardt, Thomas Langarek, Michael J. Black, Otmar Hilliges

Abstract: Reinforcement Learning has achieved significant success in generating complex behavior but often requires extensive reward function engineering. Adversarial variants of Imitation Learning and Inverse Reinforcement Learning offer an alternative by learning policies from expert demonstrations via a discriminator. Employing discriminators increases their data- and computational efficiency over the st… ▽ More Reinforcement Learning has achieved significant success in generating complex behavior but often requires extensive reward function engineering. Adversarial variants of Imitation Learning and Inverse Reinforcement Learning offer an alternative by learning policies from expert demonstrations via a discriminator. Employing discriminators increases their data- and computational efficiency over the standard approaches; however, results in sensitivity to imperfections in expert data. We propose RILe, a teacher-student system that achieves both robustness to imperfect data and efficiency. In RILe, the student learns an action policy while the teacher dynamically adjusts a reward function based on the student's performance and its alignment with expert demonstrations. By tailoring the reward function to both performance of the student and expert similarity, our system reduces dependence on the discriminator and, hence, increases robustness against data imperfections. Experiments show that RILe outperforms existing methods by 2x in settings with limited or noisy expert data. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.01595 [pdf, other]

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

Authors: Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin, Otmar Hilliges, Jie Song

Abstract: We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering i… ▽ More We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: Project page: https://eth-ait.github.io/MultiPly/

arXiv:2405.14477 [pdf, other]

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

Authors: Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, Romann M. Weber

Abstract: Advances in latent diffusion models (LDMs) have revolutionized high-resolution image generation, but the design space of the autoencoder that is central to these systems remains underexplored. In this paper, we introduce LiteVAE, a family of autoencoders for LDMs that leverage the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencode… ▽ More Advances in latent diffusion models (LDMs) have revolutionized high-resolution image generation, but the design space of the autoencoder that is central to these systems remains underexplored. In this paper, we introduce LiteVAE, a family of autoencoders for LDMs that leverage the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality. We also investigate the training methodologies and the decoder architecture of LiteVAE and propose several enhancements that improve the training dynamics and reconstruction quality. Our base LiteVAE model matches the quality of the established VAEs in current LDMs with a six-fold reduction in encoder parameters, leading to faster training and lower GPU memory requirements, while our larger model outperforms VAEs of comparable complexity across all evaluated metrics (rFID, LPIPS, PSNR, and SSIM). △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.09522 [pdf, other]

doi 10.1145/3641519.3657408

ContourCraft: Learning to Resolve Intersections in Neural Multi-Garment Simulations

Authors: Artur Grigorev, Giorgio Becherini, Michael J. Black, Otmar Hilliges, Bernhard Thomaszewski

Abstract: Learning-based approaches to cloth simulation have started to show their potential in recent years. However, handling collisions and intersections in neural simulations remains a largely unsolved problem. In this work, we present \moniker{}, a learning-based solution for handling intersections in neural cloth simulations. Unlike conventional approaches that critically rely on intersection-free inp… ▽ More Learning-based approaches to cloth simulation have started to show their potential in recent years. However, handling collisions and intersections in neural simulations remains a largely unsolved problem. In this work, we present \moniker{}, a learning-based solution for handling intersections in neural cloth simulations. Unlike conventional approaches that critically rely on intersection-free inputs, \moniker{} robustly recovers from intersections introduced through missed collisions, self-penetrating bodies, or errors in manually designed multi-layer outfits. The technical core of \moniker{} is a novel intersection contour loss that penalizes interpenetrations and encourages rapid resolution thereof. We integrate our intersection loss with a collision-avoiding repulsion objective into a neural cloth simulation method based on graph neural networks (GNNs). We demonstrate our method's ability across a challenging set of diverse multi-layer outfits under dynamic human motions. Our extensive analysis indicates that \moniker{} significantly improves collision handling for learned simulation and produces visually compelling results. △ Less

Submitted 24 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

Comments: Accepted for publication by SIGGRAPH 2024, conference track

arXiv:2404.18630 [pdf, other]

4D-DRESS: A 4D Dataset of Real-world Human Clothing with Semantic Annotations

Authors: Wenbo Wang, Hsuan-I Ho, Chen Guo, Boxiang Rong, Artur Grigorev, Jie Song, Juan Jose Zarate, Otmar Hilliges

Abstract: The studies of human clothing for digital avatars have predominantly relied on synthetic datasets. While easy to collect, synthetic data often fall short in realism and fail to capture authentic clothing dynamics. Addressing this gap, we introduce 4D-DRESS, the first real-world 4D dataset advancing human clothing research with its high-quality 4D textured scans and garment meshes. 4D-DRESS capture… ▽ More The studies of human clothing for digital avatars have predominantly relied on synthetic datasets. While easy to collect, synthetic data often fall short in realism and fail to capture authentic clothing dynamics. Addressing this gap, we introduce 4D-DRESS, the first real-world 4D dataset advancing human clothing research with its high-quality 4D textured scans and garment meshes. 4D-DRESS captures 64 outfits in 520 human motion sequences, amounting to 78k textured scans. Creating a real-world clothing dataset is challenging, particularly in annotating and segmenting the extensive and complex 4D human scans. To address this, we develop a semi-automatic 4D human parsing pipeline. We efficiently combine a human-in-the-loop process with automation to accurately label 4D scans in diverse garments and body movements. Leveraging precise annotations and high-quality garment meshes, we establish several benchmarks for clothing simulation and reconstruction. 4D-DRESS offers realistic and challenging data that complements synthetic sources, paving the way for advancements in research of lifelike human clothing. Website: https://ait.ethz.ch/4d-dress. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: CVPR 2024 paper, 21 figures, 9 tables

arXiv:2404.15383 [pdf, other]

WANDR: Intention-guided Human Motion Generation

Authors: Markos Diomataris, Nikos Athanasiou, Omid Taheri, Xi Wang, Otmar Hilliges, Michael J. Black

Abstract: Synthesizing natural human motions that enable a 3D human avatar to walk and reach for arbitrary goals in 3D space remains an unsolved problem with many applications. Existing methods (data-driven or using reinforcement learning) are limited in terms of generalization and motion naturalness. A primary obstacle is the scarcity of training data that combines locomotion with goal reaching. To address… ▽ More Synthesizing natural human motions that enable a 3D human avatar to walk and reach for arbitrary goals in 3D space remains an unsolved problem with many applications. Existing methods (data-driven or using reinforcement learning) are limited in terms of generalization and motion naturalness. A primary obstacle is the scarcity of training data that combines locomotion with goal reaching. To address this, we introduce WANDR, a data-driven model that takes an avatar's initial pose and a goal's 3D position and generates natural human motions that place the end effector (wrist) on the goal location. To solve this, we introduce novel intention features that drive rich goal-oriented movement. Intention guides the agent to the goal, and interactively adapts the generation to novel situations without needing to define sub-goals or the entire motion path. Crucially, intention allows training on datasets that have goal-oriented motions as well as those that do not. WANDR is a conditional Variational Auto-Encoder (c-VAE), which we train using the AMASS and CIRCLE datasets. We evaluate our method extensively and demonstrate its ability to generate natural and long-term motions that reach 3D goals and generalize to unseen goal locations. Our models and code are available for research purposes at wandr.is.tue.mpg.de. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2403.19649 [pdf, other]

GraspXL: Generating Gras** Motions for Diverse Objects at Scale

Authors: Hui Zhang, Sammy Christen, Zicong Fan, Otmar Hilliges, Jie Song

Abstract: Human hands possess the dexterity to interact with diverse objects such as gras** specific parts of the objects and/or approaching them from desired directions. More importantly, humans can grasp objects of any shape without object-specific skills. Recent works synthesize gras** motions following single objectives such as a desired approach heading direction or a gras** area. Moreover, they… ▽ More Human hands possess the dexterity to interact with diverse objects such as gras** specific parts of the objects and/or approaching them from desired directions. More importantly, humans can grasp objects of any shape without object-specific skills. Recent works synthesize gras** motions following single objectives such as a desired approach heading direction or a gras** area. Moreover, they usually rely on expensive 3D hand-object data during training and inference, which limits their capability to synthesize gras** motions for unseen objects at scale. In this paper, we unify the generation of hand-object gras** motions across multiple motion objectives, diverse object shapes and dexterous hand morphologies in a policy learning framework GraspXL. The objectives are composed of the graspable area, heading direction during approach, wrist rotation, and hand position. Without requiring any 3D hand-object interaction data, our policy trained with 58 objects can robustly synthesize diverse gras** motions for more than 500k unseen objects with a success rate of 82.2%. At the same time, the policy adheres to objectives, which enables the generation of diverse grasps per object. Moreover, we show that our framework can be deployed to different dexterous hands and work with reconstructed or generated objects. We quantitatively and qualitatively evaluate our method to show the efficacy of our approach. Our model and code will be available. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: Project Page: https://eth-ait.github.io/graspxl/

arXiv:2403.16428 [pdf, other]

Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

Authors: Zicong Fan, Takehiko Ohkawa, Linlin Yang, Nie Lin, Zhishan Zhou, Shihao Zhou, Jiajun Liang, Zhong Gao, Xuanyang Zhang, Xue Zhang, Fei Li, Liu Zheng, Feng Lu, Karim Abou Zeid, Bastian Leibe, Jeongwan On, Seungryul Baek, Aditya Prakash, Saurabh Gupta, Kun He, Yoichi Sato, Otmar Hilliges, Hyung ** Chang, Angela Yao

Abstract: We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3D understanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. Accurately reconstructing such interactions in 3D is challenging due to heavy occlusion, viewpoint bias, camera distortion, and motion blur from the… ▽ More We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3D understanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. Accurately reconstructing such interactions in 3D is challenging due to heavy occlusion, viewpoint bias, camera distortion, and motion blur from the head movement. To this end, we designed the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits. Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks. Our analysis demonstrates the effectiveness of addressing distortion specific to egocentric cameras, adopting high-capacity transformers to learn complex hand-object interactions, and fusing predictions from different views. Our study further reveals challenging scenarios intractable with state-of-the-art methods, such as fast hand motion, object reconstruction from narrow egocentric views, and close contact between two hands and objects. Our efforts will enrich the community's knowledge foundation and facilitate future hand studies on egocentric hand-object interactions. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2401.04143 [pdf, other]

RHOBIN Challenge: Reconstruction of Human Object Interaction

Authors: Xianghui Xie, Xi Wang, Nikos Athanasiou, Bharat Lal Bhatnagar, Chun-Hao P. Huang, Kaichun Mo, Hao Chen, Xia Jia, Zerui Zhang, Liangxian Cui, Xiao Lin, Bingqiao Qian, Jie Xiao, Wenfei Yang, Hyeong** Nam, Daniel Sungho Jung, Kihoon Kim, Kyoung Mu Lee, Otmar Hilliges, Gerard Pons-Moll

Abstract: Modeling the interaction between humans and objects has been an emerging research direction in recent years. Capturing human-object interaction is however a very challenging task due to heavy occlusion and complex dynamics, which requires understanding not only 3D human pose, and object pose but also the interaction between them. Reconstruction of 3D humans and objects has been two separate resear… ▽ More Modeling the interaction between humans and objects has been an emerging research direction in recent years. Capturing human-object interaction is however a very challenging task due to heavy occlusion and complex dynamics, which requires understanding not only 3D human pose, and object pose but also the interaction between them. Reconstruction of 3D humans and objects has been two separate research fields in computer vision for a long time. We hence proposed the first RHOBIN challenge: reconstruction of human-object interactions in conjunction with the RHOBIN workshop. It was aimed at bringing the research communities of human and object reconstruction as well as interaction modeling together to discuss techniques and exchange ideas. Our challenge consists of three tracks of 3D reconstruction from monocular RGB images with a focus on dealing with challenging interaction scenarios. Our challenge attracted more than 100 participants with more than 300 submissions, indicating the broad interest in the research communities. This paper describes the settings of our challenge and discusses the winning methods of each track in more detail. We observe that the human reconstruction task is becoming mature even under heavy occlusion settings while object pose estimation and joint reconstruction remain challenging tasks. With the growing interest in interaction modeling, we hope this report can provide useful insights and foster future research in this direction. Our workshop website can be found at \href{https://rhobin-challenge.github.io/}{https://rhobin-challenge.github.io/}. △ Less

Submitted 7 January, 2024; originally announced January 2024.

Comments: 14 pages, 5 tables, 7 figure. Technical report of the CVPR'23 workshop: RHOBIN challenge (https://rhobin-challenge.github.io/)

arXiv:2312.11666 [pdf, other]

HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles

Authors: Vanessa Sklyarova, Egor Zakharov, Otmar Hilliges, Michael J. Black, Justus Thies

Abstract: We present HAAR, a new strand-based generative model for 3D human hairstyles. Specifically, based on textual inputs, HAAR produces 3D hairstyles that could be used as production-level assets in modern computer graphics engines. Current AI-based generative models take advantage of powerful 2D priors to reconstruct 3D content in the form of point clouds, meshes, or volumetric functions. However, by… ▽ More We present HAAR, a new strand-based generative model for 3D human hairstyles. Specifically, based on textual inputs, HAAR produces 3D hairstyles that could be used as production-level assets in modern computer graphics engines. Current AI-based generative models take advantage of powerful 2D priors to reconstruct 3D content in the form of point clouds, meshes, or volumetric functions. However, by using the 2D priors, they are intrinsically limited to only recovering the visual parts. Highly occluded hair structures can not be reconstructed with those methods, and they only model the ''outer shell'', which is not ready to be used in physics-based rendering or simulation pipelines. In contrast, we propose a first text-guided generative method that uses 3D hair strands as an underlying representation. Leveraging 2D visual question-answering (VQA) systems, we automatically annotate synthetic hair models that are generated from a small set of artist-created hairstyles. This allows us to train a latent diffusion model that operates in a common hairstyle UV space. In qualitative and quantitative studies, we demonstrate the capabilities of the proposed model and compare it to existing hairstyle generation approaches. △ Less

Submitted 18 December, 2023; originally announced December 2023.

Comments: For more results please refer to the project page https://haar.is.tue.mpg.de/

arXiv:2312.08558 [pdf, other]

G-MEMP: Gaze-Enhanced Multimodal Ego-Motion Prediction in Driving

Authors: M. Eren Akbiyik, Nedko Savov, Danda Pani Paudel, Nikola Popovic, Christian Vater, Otmar Hilliges, Luc Van Gool, Xi Wang

Abstract: Understanding the decision-making process of drivers is one of the keys to ensuring road safety. While the driver intent and the resulting ego-motion trajectory are valuable in develo** driver-assistance systems, existing methods mostly focus on the motions of other vehicles. In contrast, we focus on inferring the ego trajectory of a driver's vehicle using their gaze data. For this purpose, we f… ▽ More Understanding the decision-making process of drivers is one of the keys to ensuring road safety. While the driver intent and the resulting ego-motion trajectory are valuable in develo** driver-assistance systems, existing methods mostly focus on the motions of other vehicles. In contrast, we focus on inferring the ego trajectory of a driver's vehicle using their gaze data. For this purpose, we first collect a new dataset, GEM, which contains high-fidelity ego-motion videos paired with drivers' eye-tracking data and GPS coordinates. Next, we develop G-MEMP, a novel multimodal ego-trajectory prediction network that combines GPS and video input with gaze data. We also propose a new metric called Path Complexity Index (PCI) to measure the trajectory complexity. We perform extensive evaluations of the proposed method on both GEM and DR(eye)VE, an existing benchmark dataset. The results show that G-MEMP significantly outperforms state-of-the-art methods in both benchmarks. Furthermore, ablation studies demonstrate over 20% improvement in average displacement using gaze data, particularly in challenging driving scenarios with a high PCI. The data, code, and models can be found at https://eth-ait.github.io/g-memp/. △ Less

Submitted 13 December, 2023; originally announced December 2023.

arXiv:2311.18448 [pdf, other]

HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

Authors: Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J. Black, Otmar Hilliges

Abstract: Since humans interact with diverse objects every day, the holistic 3D capture of these interactions is important to understand and model human behaviour. However, most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data, restricting their ability to scale and generalize to more unconstrained interaction… ▽ More Since humans interact with diverse objects every day, the holistic 3D capture of these interactions is important to understand and model human behaviour. However, most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data, restricting their ability to scale and generalize to more unconstrained interaction settings. To this end, we introduce HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can reconstruct disentangled 3D hand and object from 2D images. We also further incorporate hand-object constraints to improve hand-object poses and consequently the reconstruction quality. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we qualitatively show its robustness in reconstructing from in-the-wild videos. Code: https://github.com/zc-alexfan/hold △ Less

Submitted 30 November, 2023; originally announced November 2023.

arXiv:2311.17944 [pdf, other]

LALM: Long-Term Action Anticipation with Language Models

Authors: Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc Van Gool, Xi Wang

Abstract: Understanding human activity is a crucial yet intricate task in egocentric vision, a field that focuses on capturing visual perspectives from the camera wearer's viewpoint. While traditional methods heavily rely on representation learning trained on extensive video data, there exists a significant limitation: obtaining effective video representations proves challenging due to the inherent complexi… ▽ More Understanding human activity is a crucial yet intricate task in egocentric vision, a field that focuses on capturing visual perspectives from the camera wearer's viewpoint. While traditional methods heavily rely on representation learning trained on extensive video data, there exists a significant limitation: obtaining effective video representations proves challenging due to the inherent complexity and variability in human activities.Furthermore, exclusive dependence on video-based learning may constrain a model's capability to generalize across long-tail classes and out-of-distribution scenarios. In this study, we introduce a novel approach for long-term action anticipation using language models (LALM), adept at addressing the complex challenges of long-term activity understanding without the need for extensive training. Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details. By leveraging the context provided by these past events, we devise a prompting strategy for action anticipation using large language models (LLMs). Moreover, we implement Maximal Marginal Relevance for example selection to facilitate in-context learning of the LLMs. Our experimental results demonstrate that LALM surpasses the state-of-the-art methods in the task of long-term action anticipation on the Ego4D benchmark. We further validate LALM on two additional benchmarks, affirming its capacity for generalization across intricate activities with different sets of taxonomies. These are achieved without specific fine-tuning. △ Less

Submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.16854 [pdf, other]

A Unified Approach for Text- and Image-guided 4D Scene Generation

Authors: Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Karsten Kreis, Otmar Hilliges, Shalini De Mello

Abstract: Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion gui… ▽ More Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks. △ Less

Submitted 7 May, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

Comments: Project page: https://research.nvidia.com/labs/nxp/dream-in-4d/

arXiv:2311.15855 [pdf, other]

SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

Authors: Hsuan-I Ho, Jie Song, Otmar Hilliges

Abstract: A long-standing goal of 3D human reconstruction is to create lifelike and fully detailed 3D humans from single-view images. The main challenge lies in inferring unknown body shapes, appearances, and clothing details in areas not visible in the images. To address this, we propose SiTH, a novel pipeline that uniquely integrates an image-conditioned diffusion model into a 3D mesh reconstruction workf… ▽ More A long-standing goal of 3D human reconstruction is to create lifelike and fully detailed 3D humans from single-view images. The main challenge lies in inferring unknown body shapes, appearances, and clothing details in areas not visible in the images. To address this, we propose SiTH, a novel pipeline that uniquely integrates an image-conditioned diffusion model into a 3D mesh reconstruction workflow. At the core of our method lies the decomposition of the challenging single-view reconstruction problem into generative hallucination and reconstruction subproblems. For the former, we employ a powerful generative diffusion model to hallucinate unseen back-view appearance based on the input images. For the latter, we leverage skinned body meshes as guidance to recover full-body texture meshes from the input and back-view images. SiTH requires as few as 500 3D human scans for training while maintaining its generality and robustness to diverse images. Extensive evaluations on two 3D human benchmarks, including our newly created one, highlighted our method's superior accuracy and perceptual quality in 3D textured human reconstruction. Our code and evaluation benchmark are available at https://ait.ethz.ch/sith △ Less

Submitted 30 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

Comments: 23 pages, 23 figures, CVPR 2024

arXiv:2311.05599 [pdf, other]

SynH2R: Synthesizing Hand-Object Motions for Learning Human-to-Robot Handovers

Authors: Sammy Christen, Lan Feng, Wei Yang, Yu-Wei Chao, Otmar Hilliges, Jie Song

Abstract: Vision-based human-to-robot handover is an important and challenging task in human-robot interaction. Recent work has attempted to train robot policies by interacting with dynamic virtual humans in simulated environments, where the policies can later be transferred to the real world. However, a major bottleneck is the reliance on human motion capture data, which is expensive to acquire and difficu… ▽ More Vision-based human-to-robot handover is an important and challenging task in human-robot interaction. Recent work has attempted to train robot policies by interacting with dynamic virtual humans in simulated environments, where the policies can later be transferred to the real world. However, a major bottleneck is the reliance on human motion capture data, which is expensive to acquire and difficult to scale to arbitrary objects and human gras** motions. In this paper, we introduce a framework that can generate plausible human gras** motions suitable for training the robot. To achieve this, we propose a hand-object synthesis method that is designed to generate handover-friendly motions similar to humans. This allows us to generate synthetic training and testing data with 100x more objects than previous work. In our experiments, we show that our method trained purely with synthetic data is competitive with state-of-the-art methods that rely on real human motion data both in simulation and on a real system. In addition, we can perform evaluations on a larger scale compared to prior work. With our newly introduced test set, we show that our model can better scale to a large variety of unseen objects and human motions compared to the baselines. Project page: https://eth-ait.github.io/synthetic-handovers/ △ Less

Submitted 9 November, 2023; originally announced November 2023.

arXiv:2310.17519 [pdf, other]

doi 10.1145/3618401

FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

Authors: Shrisha Bharadwaj, Yufeng Zheng, Otmar Hilliges, Michael J. Black, Victoria Fernandez-Abrevaya

Abstract: Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are sl… ▽ More Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are slow to train and render. Our key insight is that it is possible to efficiently learn high-fidelity 3D mesh representations via differentiable rendering by exploiting highly-optimized methods from traditional computer graphics and approximating some of the components with neural networks. To that end, we introduce FLARE, a technique that enables the creation of animatable and relightable mesh avatars from a single monocular video. First, we learn a canonical geometry using a mesh representation, enabling efficient differentiable rasterization and straightforward animation via learned blendshapes and linear blend skinning weights. Second, we follow physically-based rendering and factor observed colors into intrinsic albedo, roughness, and a neural representation of the illumination, allowing the learned avatars to be relit in novel scenes. Since our input videos are captured on a single device with a narrow field of view, modeling the surrounding environment light is non-trivial. Based on the split-sum approximation for modeling specular reflections, we address this by approximating the pre-filtered environment map with a multi-layer perceptron (MLP) modulated by the surface roughness, eliminating the need to explicitly model the light. We demonstrate that our mesh-based avatar formulation, combined with learned deformation, material, and lighting MLPs, produces avatars with high-quality geometry and appearance, while also being efficient to train and render compared to existing approaches. △ Less

Submitted 27 October, 2023; v1 submitted 26 October, 2023; originally announced October 2023.

Comments: 15 pages, Accepted: ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), 2023

Journal ref: Volume 42, article number 204, year 2023

arXiv:2310.17347 [pdf, other]

CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling

Authors: Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, Romann M. Weber

Abstract: While conditional diffusion models are known to have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. We attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffus… ▽ More While conditional diffusion models are known to have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. We attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffusion models that can increase generation diversity, especially at high guidance scales, with minimal loss of sample quality. Our sampling strategy anneals the conditioning signal by adding scheduled, monotonically decreasing Gaussian noise to the conditioning vector during inference to balance diversity and condition alignment. Our Condition-Annealed Diffusion Sampler (CADS) can be used with any pretrained model and sampling algorithm, and we show that it boosts the diversity of diffusion models in various conditional generation tasks. Further, using an existing pretrained diffusion model, CADS achieves a new state-of-the-art FID of 1.70 and 2.31 for class-conditional ImageNet generation at 256$\times$256 and 512$\times$512 respectively. △ Less

Submitted 13 May, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

Comments: Published as a conference paper at ICLR 2024

Journal ref: The Twelfth International Conference on Learning Representations (ICLR 2024)

arXiv:2310.13768 [pdf, other]

PACE: Human and Camera Motion Estimation from in-the-wild Videos

Authors: Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J. Black, Otmar Hilliges, Jan Kautz, Umar Iqbal

Abstract: We present a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the coupling of human and camera motions in the video. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use SLAM… ▽ More We present a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the coupling of human and camera motions in the video. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use SLAM as initialization, we propose to tightly integrate SLAM and human motion priors in an optimization that is inspired by bundle adjustment. Specifically, we optimize human and camera motions to match both the observed human pose and scene features. This design combines the strengths of SLAM and motion priors, which leads to significant improvements in human and camera motion estimation. We additionally introduce a motion prior that is suitable for batch optimization, making our approach significantly more efficient than existing approaches. Finally, we propose a novel synthetic dataset that enables evaluating camera motion in addition to human motion from dynamic videos. Experiments on the synthetic and real-world RICH datasets demonstrate that our approach substantially outperforms prior art in recovering both human and camera motions. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: 3DV 2024. Project page: https://nvlabs.github.io/PACE/

arXiv:2309.16859 [pdf, other]

Preface: A Data-driven Volumetric Prior for Few-shot Ultra High-resolution Face Synthesis

Authors: Marcel C. Bühler, Kripasindhu Sarkar, Tanmay Shah, Gengyan Li, Daoye Wang, Leonhard Helminger, Sergio Orts-Escolano, Dmitry Lagun, Otmar Hilliges, Thabo Beeler, Abhimitra Meka

Abstract: NeRFs have enabled highly realistic synthesis of human faces including complex appearance and reflectance effects of hair and skin. These methods typically require a large number of multi-view input images, making the process hardware intensive and cumbersome, limiting applicability to unconstrained settings. We propose a novel volumetric human face prior that enables the synthesis of ultra high-r… ▽ More NeRFs have enabled highly realistic synthesis of human faces including complex appearance and reflectance effects of hair and skin. These methods typically require a large number of multi-view input images, making the process hardware intensive and cumbersome, limiting applicability to unconstrained settings. We propose a novel volumetric human face prior that enables the synthesis of ultra high-resolution novel views of subjects that are not part of the prior's training distribution. This prior model consists of an identity-conditioned NeRF, trained on a dataset of low-resolution multi-view images of diverse humans with known camera calibration. A simple sparse landmark-based 3D alignment of the training dataset allows our model to learn a smooth latent space of geometry and appearance despite a limited number of training identities. A high-quality volumetric representation of a novel subject can be obtained by model fitting to 2 or 3 camera views of arbitrary resolution. Importantly, our method requires as few as two views of casually captured images as input at inference time. △ Less

Submitted 28 September, 2023; originally announced September 2023.

Comments: In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

arXiv:2309.07907 [pdf, other]

Physically Plausible Full-Body Hand-Object Interaction Synthesis

Authors: Jona Braun, Sammy Christen, Muhammed Kocabas, Emre Aksan, Otmar Hilliges

Abstract: We propose a physics-based method for synthesizing dexterous hand-object interactions in a full-body setting. While recent advancements have addressed specific facets of human-object interactions, a comprehensive physics-based approach remains a challenge. Existing methods often focus on isolated segments of the interaction process and rely on data-driven techniques that may result in artifacts. I… ▽ More We propose a physics-based method for synthesizing dexterous hand-object interactions in a full-body setting. While recent advancements have addressed specific facets of human-object interactions, a comprehensive physics-based approach remains a challenge. Existing methods often focus on isolated segments of the interaction process and rely on data-driven techniques that may result in artifacts. In contrast, our proposed method embraces reinforcement learning (RL) and physics simulation to mitigate the limitations of data-driven approaches. Through a hierarchical framework, we first learn skill priors for both body and hand movements in a decoupled setting. The generic skill priors learn to decode a latent skill embedding into the motion of the underlying part. A high-level policy then controls hand-object interactions in these pretrained latent spaces, guided by task objectives of gras** and 3D target trajectory following. It is trained using a novel reward function that combines an adversarial style term with a task reward, encouraging natural motions while fulfilling the task incentives. Our method successfully accomplishes the complete interaction task, from approaching an object to gras** and subsequent manipulation. We compare our approach against kinematics-based baselines and show that it leads to more physically plausible motions. △ Less

Submitted 14 September, 2023; originally announced September 2023.

Comments: Project page at https://eth-ait.github.io/phys-fullbody-grasp

arXiv:2309.03891 [pdf, other]

ArtiGrasp: Physically Plausible Synthesis of Bi-Manual Dexterous Gras** and Articulation

Authors: Hui Zhang, Sammy Christen, Zicong Fan, Luocheng Zheng, Jemin Hwangbo, Jie Song, Otmar Hilliges

Abstract: We present ArtiGrasp, a novel method to synthesize bi-manual hand-object interactions that include gras** and articulation. This task is challenging due to the diversity of the global wrist motions and the precise finger control that are necessary to articulate objects. ArtiGrasp leverages reinforcement learning and physics simulations to train a policy that controls the global and local hand po… ▽ More We present ArtiGrasp, a novel method to synthesize bi-manual hand-object interactions that include gras** and articulation. This task is challenging due to the diversity of the global wrist motions and the precise finger control that are necessary to articulate objects. ArtiGrasp leverages reinforcement learning and physics simulations to train a policy that controls the global and local hand pose. Our framework unifies gras** and articulation within a single policy guided by a single hand pose reference. Moreover, to facilitate the training of the precise finger control required for articulation, we present a learning curriculum with increasing difficulty. It starts with single-hand manipulation of stationary objects and continues with multi-agent training including both hands and non-stationary objects. To evaluate our method, we introduce Dynamic Object Gras** and Articulation, a task that involves bringing an object into a target articulated pose. This task requires gras**, relocation, and articulation. We show our method's efficacy towards this task. We further demonstrate that our method can generate motions with noisy hand-object pose estimates from an off-the-shelf image-based regressor. △ Less

Submitted 3 March, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

Comments: 3DV-2024 camera ready. Project page: https://eth-ait.github.io/artigrasp/

arXiv:2308.16894 [pdf, other]

EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild

Authors: Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tianjian Jiang, Chengcheng Tang, Juan Zarate, Otmar Hilliges

Abstract: We present EMDB, the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. EMDB is a novel dataset that contains high-quality 3D SMPL pose and shape parameters with global body and camera trajectories for in-the-wild videos. We use body-worn, wireless electromagnetic (EM) sensors and a hand-held iPhone to record a total of 58 minutes of motion data, distributed over 81 indoor and… ▽ More We present EMDB, the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. EMDB is a novel dataset that contains high-quality 3D SMPL pose and shape parameters with global body and camera trajectories for in-the-wild videos. We use body-worn, wireless electromagnetic (EM) sensors and a hand-held iPhone to record a total of 58 minutes of motion data, distributed over 81 indoor and outdoor sequences and 10 participants. Together with accurate body poses and shapes, we also provide global camera poses and body root trajectories. To construct EMDB, we propose a multi-stage optimization procedure, which first fits SMPL to the 6-DoF EM measurements and then refines the poses via image observations. To achieve high-quality results, we leverage a neural implicit avatar model to reconstruct detailed human surface geometry and appearance, which allows for improved alignment and smoothness via a dense pixel-level objective. Our evaluations, conducted with a multi-view volumetric capture system, indicate that EMDB has an expected accuracy of 2.3 cm positional and 10.6 degrees angular error, surpassing the accuracy of previous in-the-wild datasets. We evaluate existing state-of-the-art monocular RGB methods for camera-relative and global pose estimation on EMDB. EMDB is publicly available under https://ait.ethz.ch/emdb △ Less

Submitted 31 August, 2023; originally announced August 2023.

Comments: Accepted to ICCV 2023

arXiv:2306.16545 [pdf, other]

Palm: Predicting Actions through Language Models @ Ego4D Long-Term Action Anticipation Challenge 2023

Authors: Daoji Huang, Otmar Hilliges, Luc Van Gool, Xi Wang

Abstract: We present Palm, a solution to the Long-Term Action Anticipation (LTA) task utilizing vision-language and large language models. Given an input video with annotated action periods, the LTA task aims to predict possible future actions. We hypothesize that an optimal solution should capture the interdependency between past and future actions, and be able to infer future actions based on the structur… ▽ More We present Palm, a solution to the Long-Term Action Anticipation (LTA) task utilizing vision-language and large language models. Given an input video with annotated action periods, the LTA task aims to predict possible future actions. We hypothesize that an optimal solution should capture the interdependency between past and future actions, and be able to infer future actions based on the structure and dependency encoded in the past actions. Large language models have demonstrated remarkable commonsense-based reasoning ability. Inspired by that, Palm chains an image captioning model and a large language model. It predicts future actions based on frame descriptions and action labels extracted from the input videos. Our method outperforms other participants in the EGO4D LTA challenge and achieves the best performance in terms of action prediction. Our code is available at https://github.com/DanDoge/Palm △ Less

Submitted 28 June, 2023; originally announced June 2023.

arXiv:2305.05526 [pdf, other]

EFE: End-to-end Frame-to-Gaze Estimation

Authors: Haldun Balim, Seonwook Park, Xi Wang, Xucong Zhang, Otmar Hilliges

Abstract: Despite the recent development of learning-based gaze estimation methods, most methods require one or more eye or face region crops as inputs and produce a gaze direction vector as output. Crop** results in a higher resolution in the eye regions and having fewer confounding factors (such as clothing and hair) is believed to benefit the final model performance. However, this eye/face patch croppi… ▽ More Despite the recent development of learning-based gaze estimation methods, most methods require one or more eye or face region crops as inputs and produce a gaze direction vector as output. Crop** results in a higher resolution in the eye regions and having fewer confounding factors (such as clothing and hair) is believed to benefit the final model performance. However, this eye/face patch crop** process is expensive, erroneous, and implementation-specific for different methods. In this paper, we propose a frame-to-gaze network that directly predicts both 3D gaze origin and 3D gaze direction from the raw frame out of the camera without any face or eye crop**. Our method demonstrates that direct gaze regression from the raw downscaled frame, from FHD/HD to VGA/HVGA resolution, is possible despite the challenges of having very few pixels in the eye region. The proposed method achieves comparable results to state-of-the-art methods in Point-of-Gaze (PoG) estimation on three public gaze datasets: GazeCapture, MPIIFaceGaze, and EVE, and generalizes well to extreme camera view changes. △ Less

Submitted 9 May, 2023; originally announced May 2023.

arXiv:2305.02312 [pdf, other]

AG3D: Learning to Generate 3D Avatars from 2D Image Collections

Authors: Zijian Dong, Xu Chen, **long Yang, Michael J. Black, Otmar Hilliges, Andreas Geiger

Abstract: While progress in 2D generative models of human appearance has been rapid, many applications require 3D avatars that can be animated and rendered. Unfortunately, most existing methods for learning generative models of 3D humans with diverse shape and appearance require 3D training data, which is limited and expensive to acquire. The key to progress is hence to learn generative models of 3D avatars… ▽ More While progress in 2D generative models of human appearance has been rapid, many applications require 3D avatars that can be animated and rendered. Unfortunately, most existing methods for learning generative models of 3D humans with diverse shape and appearance require 3D training data, which is limited and expensive to acquire. The key to progress is hence to learn generative models of 3D avatars from abundant unstructured 2D image collections. However, learning realistic and complete 3D appearance and geometry in this under-constrained setting remains challenging, especially in the presence of loose clothing such as dresses. In this paper, we propose a new adversarial generative model of realistic 3D people from 2D images. Our method captures shape and deformation of the body and loose clothing by adopting a holistic 3D generator and integrating an efficient and flexible articulation module. To improve realism, we train our model using multiple discriminators while also integrating geometric cues in the form of predicted 2D normal maps. We experimentally find that our method outperforms previous 3D- and articulation-aware methods in terms of geometry and appearance. We validate the effectiveness of our model and the importance of each component via systematic ablation studies. △ Less

Submitted 3 May, 2023; originally announced May 2023.

Comments: Project Page: https://zj-dong.github.io/AG3D/

arXiv:2305.00121 [pdf, other]

Learning Locally Editable Virtual Humans

Authors: Hsuan-I Ho, Lixin Xue, Jie Song, Otmar Hilliges

Abstract: In this paper, we propose a novel hybrid representation and end-to-end trainable network architecture to model fully editable and customizable neural avatars. At the core of our work lies a representation that combines the modeling power of neural fields with the ease of use and inherent 3D consistency of skinned meshes. To this end, we construct a trainable feature codebook to store local geometr… ▽ More In this paper, we propose a novel hybrid representation and end-to-end trainable network architecture to model fully editable and customizable neural avatars. At the core of our work lies a representation that combines the modeling power of neural fields with the ease of use and inherent 3D consistency of skinned meshes. To this end, we construct a trainable feature codebook to store local geometry and texture features on the vertices of a deformable body model, thus exploiting its consistent topology under articulation. This representation is then employed in a generative auto-decoder architecture that admits fitting to unseen scans and sampling of realistic avatars with varied appearances and geometries. Furthermore, our representation allows local editing by swap** local features between 3D assets. To verify our method for avatar creation and editing, we contribute a new high-quality dataset, dubbed CustomHumans, for training and evaluation. Our experiments quantitatively and qualitatively show that our method generates diverse detailed avatars and achieves better model fitting performance compared to state-of-the-art methods. Our code and dataset are available at https://custom-humans.github.io/. △ Less

Submitted 28 April, 2023; originally announced May 2023.

Comments: 12+11 pages, CVPR'23, project page https://custom-humans.github.io/

arXiv:2303.17592 [pdf, other]

Learning Human-to-Robot Handovers from Point Clouds

Authors: Sammy Christen, Wei Yang, Claudia Pérez-D'Arpino, Otmar Hilliges, Dieter Fox, Yu-Wei Chao

Abstract: We propose the first framework to learn control policies for vision-based human-to-robot handovers, a critical task for human-robot interaction. While research in Embodied AI has made significant progress in training robot agents in simulated environments, interacting with humans remains challenging due to the difficulties of simulating humans. Fortunately, recent research has developed realistic… ▽ More We propose the first framework to learn control policies for vision-based human-to-robot handovers, a critical task for human-robot interaction. While research in Embodied AI has made significant progress in training robot agents in simulated environments, interacting with humans remains challenging due to the difficulties of simulating humans. Fortunately, recent research has developed realistic simulated environments for human-to-robot handovers. Leveraging this result, we introduce a method that is trained with a human-in-the-loop via a two-stage teacher-student framework that uses motion and grasp planning, reinforcement learning, and self-supervision. We show significant performance gains over baselines on a simulation benchmark, sim-to-sim transfer and sim-to-real transfer. △ Less

Submitted 30 March, 2023; originally announced March 2023.

Comments: Accepted at CVPR 2023 as highlight. Project page at https://handover-sim2real.github.io

arXiv:2303.17209 [pdf, other]

Human from Blur: Human Pose Tracking from Blurry Images

Authors: Yiming Zhao, Denys Rozumnyi, Jie Song, Otmar Hilliges, Marc Pollefeys, Martin R. Oswald

Abstract: We propose a method to estimate 3D human poses from substantially blurred images. The key idea is to tackle the inverse problem of image deblurring by modeling the forward problem with a 3D human model, a texture map, and a sequence of poses to describe human motion. The blurring process is then modeled by a temporal image aggregation step. Using a differentiable renderer, we can solve the inverse… ▽ More We propose a method to estimate 3D human poses from substantially blurred images. The key idea is to tackle the inverse problem of image deblurring by modeling the forward problem with a 3D human model, a texture map, and a sequence of poses to describe human motion. The blurring process is then modeled by a temporal image aggregation step. Using a differentiable renderer, we can solve the inverse problem by backpropagating the pixel-wise reprojection error to recover the best human motion representation that explains a single or multiple input images. Since the image reconstruction loss alone is insufficient, we present additional regularization terms. To the best of our knowledge, we present the first method to tackle this problem. Our method consistently outperforms other methods on significantly blurry inputs since they lack one or multiple key functionalities that our method unifies, i.e. image deblurring with sub-frame accuracy and explicit 3D modeling of non-rigid human motion. △ Less

Submitted 25 September, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

Comments: typos and minor error fixed

arXiv:2303.15380 [pdf, other]

Hi4D: 4D Instance Segmentation of Close Human Interaction

Authors: Yifei Yin, Chen Guo, Manuel Kaufmann, Juan Jose Zarate, Jie Song, Otmar Hilliges

Abstract: We propose Hi4D, a method and dataset for the automatic analysis of physically close human-human interaction under prolonged contact. Robustly disentangling several in-contact subjects is a challenging task due to occlusions and complex shapes. Hence, existing multi-view systems typically fuse 3D surfaces of close subjects into a single, connected mesh. To address this issue we leverage i) individ… ▽ More We propose Hi4D, a method and dataset for the automatic analysis of physically close human-human interaction under prolonged contact. Robustly disentangling several in-contact subjects is a challenging task due to occlusions and complex shapes. Hence, existing multi-view systems typically fuse 3D surfaces of close subjects into a single, connected mesh. To address this issue we leverage i) individually fitted neural implicit avatars; ii) an alternating optimization scheme that refines pose and surface through periods of close proximity; and iii) thus segment the fused raw scans into individual instances. From these instances we compile Hi4D dataset of 4D textured scans of 20 subject pairs, 100 sequences, and a total of more than 11K frames. Hi4D contains rich interaction-centric annotations in 2D and 3D alongside accurately registered parametric body models. We define varied human pose and shape estimation tasks on this dataset and provide results from state-of-the-art methods on these benchmarks. △ Less

Submitted 27 March, 2023; originally announced March 2023.

Comments: Project page: https://yifeiyin04.github.io/Hi4D/

arXiv:2303.09628 [pdf, other]

Efficient Learning of High Level Plans from Play

Authors: Núria Armengol Urpí, Marco Bagatella, Otmar Hilliges, Georg Martius, Stelian Coros

Abstract: Real-world robotic manipulation tasks remain an elusive challenge, since they involve both fine-grained environment interaction, as well as the ability to plan for long-horizon goals. Although deep reinforcement learning (RL) methods have shown encouraging results when planning end-to-end in high-dimensional environments, they remain fundamentally limited by poor sample efficiency due to inefficie… ▽ More Real-world robotic manipulation tasks remain an elusive challenge, since they involve both fine-grained environment interaction, as well as the ability to plan for long-horizon goals. Although deep reinforcement learning (RL) methods have shown encouraging results when planning end-to-end in high-dimensional environments, they remain fundamentally limited by poor sample efficiency due to inefficient exploration, and by the complexity of credit assignment over long horizons. In this work, we present Efficient Learning of High-Level Plans from Play (ELF-P), a framework for robotic learning that bridges motion planning and deep RL to achieve long-horizon complex manipulation tasks. We leverage task-agnostic play data to learn a discrete behavioral prior over object-centric primitives, modeling their feasibility given the current context. We then design a high-level goal-conditioned policy which (1) uses primitives as building blocks to scaffold complex long-horizon tasks and (2) leverages the behavioral prior to accelerate learning. We demonstrate that ELF-P has significantly better sample efficiency than relevant baselines over multiple realistic manipulation tasks and learns policies that can be easily transferred to physical hardware. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: Accepted to the International Conference on Robotics and Automation 2023

arXiv:2303.04805 [pdf, other]

X-Avatar: Expressive Human Avatars

Authors: Kaiyue Shen, Chen Guo, Manuel Kaufmann, Juan Jose Zarate, Julien Valentin, Jie Song, Otmar Hilliges

Abstract: We present X-Avatar, a novel avatar model that captures the full expressiveness of digital humans to bring about life-like experiences in telepresence, AR/VR and beyond. Our method models bodies, hands, facial expressions and appearance in a holistic fashion and can be learned from either full 3D scans or RGB-D data. To achieve this, we propose a part-aware learned forward skinning module that can… ▽ More We present X-Avatar, a novel avatar model that captures the full expressiveness of digital humans to bring about life-like experiences in telepresence, AR/VR and beyond. Our method models bodies, hands, facial expressions and appearance in a holistic fashion and can be learned from either full 3D scans or RGB-D data. To achieve this, we propose a part-aware learned forward skinning module that can be driven by the parameter space of SMPL-X, allowing for expressive animation of X-Avatars. To efficiently learn the neural shape and deformation fields, we propose novel part-aware sampling and initialization strategies. This leads to higher fidelity results, especially for smaller body parts while maintaining efficient training despite increased number of articulated bones. To capture the appearance of the avatar with high-frequency details, we extend the geometry and deformation fields with a texture network that is conditioned on pose, facial expression, geometry and the normals of the deformed surface. We show experimentally that our method outperforms strong baselines in both data domains both quantitatively and qualitatively on the animation task. To facilitate future research on expressive avatars we contribute a new dataset, called X-Humans, containing 233 sequences of high-quality textured scans from 20 participants, totalling 35,500 data frames. △ Less

Submitted 9 March, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

Comments: Project page: https://skype-line.github.io/projects/X-Avatar/

arXiv:2302.11566 [pdf, other]

Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition

Authors: Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, Otmar Hilliges

Abstract: We present Vid2Avatar, a method to learn human avatars from monocular in-the-wild videos. Reconstructing humans that move naturally from monocular in-the-wild videos is difficult. Solving it requires accurately separating humans from arbitrary backgrounds. Moreover, it requires reconstructing detailed 3D surface from short video sequences, making it even more challenging. Despite these challenges,… ▽ More We present Vid2Avatar, a method to learn human avatars from monocular in-the-wild videos. Reconstructing humans that move naturally from monocular in-the-wild videos is difficult. Solving it requires accurately separating humans from arbitrary backgrounds. Moreover, it requires reconstructing detailed 3D surface from short video sequences, making it even more challenging. Despite these challenges, our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans, nor do we rely on any external segmentation modules. Instead, it solves the tasks of scene decomposition and surface reconstruction directly in 3D by modeling both the human and the background in the scene jointly, parameterized via two separate neural fields. Specifically, we define a temporally consistent human representation in canonical space and formulate a global optimization over the background model, the canonical human shape and texture, and per-frame human pose parameters. A coarse-to-fine sampling strategy for volume rendering and novel objectives are introduced for a clean separation of dynamic human and static background, yielding detailed and robust 3D human geometry reconstructions. We evaluate our methods on publicly available datasets and show improvements over prior art. △ Less

Submitted 22 February, 2023; originally announced February 2023.

Comments: Project page: https://moygcc.github.io/vid2avatar/

arXiv:2301.09209 [pdf, other]

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

Authors: Razvan-George Pasca, Alexey Gavryushin, Muhammad Hamza, Yen-Ling Kuo, Kaichun Mo, Luc Van Gool, Otmar Hilliges, Xi Wang

Abstract: We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarizing the action context. TransFusion leverages pre-trained image captioning and vi… ▽ More We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarizing the action context. TransFusion leverages pre-trained image captioning and vision-language models to extract the action context from past video frames. This action context together with the next video frame is processed by the multimodal fusion module to forecast the next object interaction. Our model enables more efficient end-to-end learning. The large pre-trained language models add common sense and a generalisation capability. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model. They also highlight the benefits of using language-based context summaries in a task where vision seems to suffice. Our method outperforms state-of-the-art approaches by 40.4% in relative terms in overall mAP on the Ego4D test set. We validate the effectiveness of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at https://eth-ait.github.io/transfusion-proj/. △ Less

Submitted 10 March, 2024; v1 submitted 22 January, 2023; originally announced January 2023.

arXiv:2212.10550 [pdf, other]

InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds

Authors: Tianjian Jiang, Xu Chen, Jie Song, Otmar Hilliges

Abstract: In this paper, we take a significant step towards real-world applicability of monocular neural avatar reconstruction by contributing InstantAvatar, a system that can reconstruct human avatars from a monocular video within seconds, and these avatars can be animated and rendered at an interactive rate. To achieve this efficiency we propose a carefully designed and engineered system, that leverages e… ▽ More In this paper, we take a significant step towards real-world applicability of monocular neural avatar reconstruction by contributing InstantAvatar, a system that can reconstruct human avatars from a monocular video within seconds, and these avatars can be animated and rendered at an interactive rate. To achieve this efficiency we propose a carefully designed and engineered system, that leverages emerging acceleration structures for neural fields, in combination with an efficient empty space-skip** strategy for dynamic scenes. We also contribute an efficient implementation that we will make available for research purposes. Compared to existing methods, InstantAvatar converges 130x faster and can be trained in minutes instead of hours. It achieves comparable or even better reconstruction quality and novel pose synthesis results. When given the same time budget, our method significantly outperforms SoTA methods. InstantAvatar can yield acceptable visual quality in as little as 10 seconds training time. △ Less

Submitted 20 December, 2022; originally announced December 2022.

Comments: 12 pages

arXiv:2212.09530 [pdf, other]

HARP: Personalized Hand Reconstruction from a Monocular RGB Video

Authors: Korrawe Karunratanakul, Sergey Prokudin, Otmar Hilliges, Siyu Tang

Abstract: We present HARP (HAnd Reconstruction and Personalization), a personalized hand avatar creation approach that takes a short monocular RGB video of a human hand as input and reconstructs a faithful hand avatar exhibiting a high-fidelity appearance and geometry. In contrast to the major trend of neural implicit representations, HARP models a hand with a mesh-based parametric hand model, a vertex disp… ▽ More We present HARP (HAnd Reconstruction and Personalization), a personalized hand avatar creation approach that takes a short monocular RGB video of a human hand as input and reconstructs a faithful hand avatar exhibiting a high-fidelity appearance and geometry. In contrast to the major trend of neural implicit representations, HARP models a hand with a mesh-based parametric hand model, a vertex displacement map, a normal map, and an albedo without any neural components. As validated by our experiments, the explicit nature of our representation enables a truly scalable, robust, and efficient approach to hand avatar creation. HARP is optimized via gradient descent from a short sequence captured by a hand-held mobile phone and can be directly used in AR/VR applications with real-time rendering capability. To enable this, we carefully design and implement a shadow-aware differentiable rendering scheme that is robust to high degree articulations and self-shadowing regularly present in hand motion sequences, as well as challenging lighting conditions. It also generalizes to unseen poses and novel viewpoints, producing photo-realistic renderings of hand animations performing highly-articulated motions. Furthermore, the learned HARP representation can be used for improving 3D hand pose estimation quality in challenging viewpoints. The key advantages of HARP are validated by the in-depth analyses on appearance reconstruction, novel-view and novel pose synthesis, and 3D hand pose refinement. It is an AR/VR-ready personalized hand representation that shows superior fidelity and scalability. △ Less

Submitted 3 July, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

Comments: CVPR 2023. Project page: https://korrawe.github.io/harp-project/

arXiv:2212.08377 [pdf, other]

PointAvatar: Deformable Point-based Head Avatars from Videos

Authors: Yufeng Zheng, Wang Yifan, Gordon Wetzstein, Michael J. Black, Otmar Hilliges

Abstract: The ability to create realistic, animatable and relightable head avatars from casual video sequences would open up wide ranging applications in communication and entertainment. Current methods either build on explicit 3D morphable meshes (3DMM) or exploit neural implicit representations. The former are limited by fixed topology, while the latter are non-trivial to deform and inefficient to render.… ▽ More The ability to create realistic, animatable and relightable head avatars from casual video sequences would open up wide ranging applications in communication and entertainment. Current methods either build on explicit 3D morphable meshes (3DMM) or exploit neural implicit representations. The former are limited by fixed topology, while the latter are non-trivial to deform and inefficient to render. Furthermore, existing approaches entangle lighting in the color estimation, thus they are limited in re-rendering the avatar in new environments. In contrast, we propose PointAvatar, a deformable point-based representation that disentangles the source color into intrinsic albedo and normal-dependent shading. We demonstrate that PointAvatar bridges the gap between existing mesh- and implicit representations, combining high-quality geometry and appearance with topological flexibility, ease of deformation and rendering efficiency. We show that our method is able to generate animatable 3D avatars using monocular videos from multiple sources including hand-held smartphones, laptop webcams and internet videos, achieving state-of-the-art quality in challenging cases where previous methods fail, e.g., thin hair strands, while being significantly more efficient in training than competing methods. △ Less

Submitted 28 February, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

Comments: Project page: https://zhengyuf.github.io/PointAvatar/ Code base: https://github.com/zhengyuf/pointavatar

arXiv:2212.07242 [pdf, other]

HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics

Authors: Artur Grigorev, Bernhard Thomaszewski, Michael J. Black, Otmar Hilliges

Abstract: We propose a method that leverages graph neural networks, multi-level message passing, and unsupervised training to enable real-time prediction of realistic clothing dynamics. Whereas existing methods based on linear blend skinning must be trained for specific garments, our method is agnostic to body shape and applies to tight-fitting garments as well as loose, free-flowing clothing. Our method fu… ▽ More We propose a method that leverages graph neural networks, multi-level message passing, and unsupervised training to enable real-time prediction of realistic clothing dynamics. Whereas existing methods based on linear blend skinning must be trained for specific garments, our method is agnostic to body shape and applies to tight-fitting garments as well as loose, free-flowing clothing. Our method furthermore handles changes in topology (e.g., garments with buttons or zippers) and material properties at inference time. As one key contribution, we propose a hierarchical message-passing scheme that efficiently propagates stiff stretching modes while preserving local detail. We empirically show that our method outperforms strong baselines quantitatively and that its results are perceived as more realistic than state-of-the-art methods. △ Less

Submitted 16 June, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 16965-16974

arXiv:2212.04823 [pdf, other]

GazeNeRF: 3D-Aware Gaze Redirection with Neural Radiance Fields

Authors: Alessandro Ruzzi, Xiangwei Shi, Xi Wang, Gengyan Li, Shalini De Mello, Hyung ** Chang, Xucong Zhang, Otmar Hilliges

Abstract: We propose GazeNeRF, a 3D-aware method for the task of gaze redirection. Existing gaze redirection methods operate on 2D images and struggle to generate 3D consistent results. Instead, we build on the intuition that the face region and eyeballs are separate 3D structures that move in a coordinated yet independent fashion. Our method leverages recent advancements in conditional image-based neural r… ▽ More We propose GazeNeRF, a 3D-aware method for the task of gaze redirection. Existing gaze redirection methods operate on 2D images and struggle to generate 3D consistent results. Instead, we build on the intuition that the face region and eyeballs are separate 3D structures that move in a coordinated yet independent fashion. Our method leverages recent advancements in conditional image-based neural radiance fields and proposes a two-stream architecture that predicts volumetric features for the face and eye regions separately. Rigidly transforming the eye features via a 3D rotation matrix provides fine-grained control over the desired gaze angle. The final, redirected image is then attained via differentiable volume compositing. Our experiments show that this architecture outperforms naively conditioned NeRF baselines as well as previous state-of-the-art 2D gaze redirection methods in terms of redirection accuracy and identity preservation. △ Less

Submitted 28 March, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

Comments: Accepted at CVPR 2023. Github page: https://github.com/AlessandroRuzzi/GazeNeRF

arXiv:2211.16630 [pdf, other]

DINER: Depth-aware Image-based NEural Radiance fields

Authors: Malte Prinzler, Otmar Hilliges, Justus Thies

Abstract: We present Depth-aware Image-based NEural Radiance fields (DINER). Given a sparse set of RGB input views, we predict depth and feature maps to guide the reconstruction of a volumetric scene representation that allows us to render 3D objects under novel views. Specifically, we propose novel techniques to incorporate depth information into feature fusion and efficient scene sampling. In comparison t… ▽ More We present Depth-aware Image-based NEural Radiance fields (DINER). Given a sparse set of RGB input views, we predict depth and feature maps to guide the reconstruction of a volumetric scene representation that allows us to render 3D objects under novel views. Specifically, we propose novel techniques to incorporate depth information into feature fusion and efficient scene sampling. In comparison to the previous state of the art, DINER achieves higher synthesis quality and can process input views with greater disparity. This allows us to capture scenes more completely without changing capturing hardware requirements and ultimately enables larger viewpoint changes during novel view synthesis. We evaluate our method by synthesizing novel views, both for human heads and for general objects, and observe significantly improved qualitative results and increased perceptual metrics compared to the previous state of the art. The code is publicly available for research purposes. △ Less

Submitted 30 March, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: Website: https://malteprinzler.github.io/projects/diner/diner.html ; Video: https://www.youtube.com/watch?v=iI_fpjY5k8Y&t=1s

arXiv:2211.15601 [pdf, other]

Fast-SNARF: A Fast Deformer for Articulated Neural Fields

Authors: Xu Chen, Tianjian Jiang, Jie Song, Max Rietmann, Andreas Geiger, Michael J. Black, Otmar Hilliges

Abstract: Neural fields have revolutionized the area of 3D reconstruction and novel view synthesis of rigid scenes. A key challenge in making such methods applicable to articulated objects, such as the human body, is to model the deformation of 3D locations between the rest pose (a canonical space) and the deformed space. We propose a new articulation module for neural fields, Fast-SNARF, which finds accura… ▽ More Neural fields have revolutionized the area of 3D reconstruction and novel view synthesis of rigid scenes. A key challenge in making such methods applicable to articulated objects, such as the human body, is to model the deformation of 3D locations between the rest pose (a canonical space) and the deformed space. We propose a new articulation module for neural fields, Fast-SNARF, which finds accurate correspondences between canonical space and posed space via iterative root finding. Fast-SNARF is a drop-in replacement in functionality to our previous work, SNARF, while significantly improving its computational efficiency. We contribute several algorithmic and implementation improvements over SNARF, yielding a speed-up of $150\times$. These improvements include voxel-based correspondence search, pre-computing the linear blend skinning function, and an efficient software implementation with CUDA kernels. Fast-SNARF enables efficient and simultaneous optimization of shape and skinning weights given deformed observations without correspondences (e.g. 3D meshes). Because learning of deformation maps is a crucial component in many 3D human avatar methods and since Fast-SNARF provides a computationally efficient solution, we believe that this work represents a significant step towards the practical creation of 3D virtual humans. △ Less

Submitted 1 December, 2022; v1 submitted 28 November, 2022; originally announced November 2022.

Comments: github page: https://github.com/xuchen-ethz/fast-snarf

arXiv:2211.07556 [pdf, other]

Utilizing Synthetic Data in Supervised Learning for Robust 5-DoF Magnetic Marker Localization

Authors: Mengfan Wu, Thomas Langerak, Otmar Hilliges, Juan Zarate

Abstract: Tracking passive magnetic markers plays a vital role in advancing healthcare and robotics, offering the potential to significantly improve the precision and efficiency of systems. This technology is key to develo** smarter, more responsive tools and devices, such as enhanced surgical instruments, precise diagnostic tools, and robots with improved environmental interaction capabilities. However,… ▽ More Tracking passive magnetic markers plays a vital role in advancing healthcare and robotics, offering the potential to significantly improve the precision and efficiency of systems. This technology is key to develo** smarter, more responsive tools and devices, such as enhanced surgical instruments, precise diagnostic tools, and robots with improved environmental interaction capabilities. However, traditionally, the tracking of magnetic markers is computationally expensive due to the requirement for iterative optimization procedures. Moreover, these methods depend on the magnetic dipole model for their optimization function, which can yield imprecise outcomes due to the model's significant inaccuracies when dealing with short distances between non-spherical magnet and sensor.Our paper introduces a novel approach that leverages neural networks to bypass these limitations, directly inferring the marker's position and orientation to accurately determine the magnet's 5 DoF in a single step without initial estimation. Although our method demands an extensive supervised training phase, we mitigate this by introducing a computationally more efficient method to generate synthetic, yet realistic data using Finite Element Methods simulations. The benefits of fast and accurate inference significantly outweigh the offline training preparation. In our evaluation, we use different cylindrical magnets, tracked with a square array of 16 sensors. We perform the sensors' reading and position inference on a portable, neural networks-oriented single-board computer, ensuring a compact setup. We benchmark our prototype against vision-based ground truth data, achieving a mean positional error of 4 mm and an orientation error of 8 degrees within a 0.2x0.2x0.15 m working volume. These results showcase our prototype's ability to balance accuracy and compactness effectively in tracking 5 DoF. △ Less

Submitted 25 March, 2024; v1 submitted 14 November, 2022; originally announced November 2022.

arXiv:2210.07689 [pdf, other]

doi 10.1145/3526113.3545674

Computational Design of Active Kinesthetic Garments

Authors: Velko Vechev, Ronan Hinchet, Stelian Coros, Bernhard Thomaszewski, Otmar Hilliges

Abstract: Garments with the ability to provide kinesthetic force-feedback on-demand can augment human capabilities in a non-obtrusive way, enabling numerous applications in VR haptics, motion assistance, and robotic control. However, designing such garments is a complex, and often manual task, particularly when the goal is to resist multiple motions with a single design. In this work, we propose a computati… ▽ More Garments with the ability to provide kinesthetic force-feedback on-demand can augment human capabilities in a non-obtrusive way, enabling numerous applications in VR haptics, motion assistance, and robotic control. However, designing such garments is a complex, and often manual task, particularly when the goal is to resist multiple motions with a single design. In this work, we propose a computational pipeline for designing connecting structures between active components - one of the central challenges in this context. We focus on electrostatic (ES) clutches that are compliant in their passive state while strongly resisting elongation when activated. Our method automatically computes optimized connecting structures that efficiently resist a range of pre-defined body motions on demand. We propose a novel dual-objective optimization approach to simultaneously maximize the resistance to motion when clutches are active, while minimizing resistance when inactive. We demonstrate our method on a set of problems involving different body sites and a range of motions. We further fabricate and evaluate a subset of our automatically created designs against manually created baselines using mechanical testing and in a VR pointing study. △ Less

Submitted 14 October, 2022; originally announced October 2022.

arXiv:2209.12660 [pdf, other]

MARLUI: Multi-Agent Reinforcement Learning for Adaptive UIs

Authors: Thomas Langerak, Sammy Christen, Mert Albaba, Christoph Gebhardt, Otmar Hilliges

Abstract: Adaptive user interfaces (UIs) automatically change an interface to better support users' tasks. Recently, machine learning techniques have enabled the transition to more powerful and complex adaptive UIs. However, a core challenge for adaptive user interfaces is the reliance on high-quality user data that has to be collected offline for each task. We formulate UI adaptation as a multi-agent reinf… ▽ More Adaptive user interfaces (UIs) automatically change an interface to better support users' tasks. Recently, machine learning techniques have enabled the transition to more powerful and complex adaptive UIs. However, a core challenge for adaptive user interfaces is the reliance on high-quality user data that has to be collected offline for each task. We formulate UI adaptation as a multi-agent reinforcement learning problem to overcome this challenge. In our formulation, a user agent mimics a real user and learns to interact with a UI. Simultaneously, an interface agent learns UI adaptations to maximize the user agent's performance. The interface agent learns the task structure from the user agent's behavior and, based on that, can support the user agent in completing its task. Our method produces adaptation policies that are learned in simulation only and, therefore, does not need real user data. Our experiments show that learned policies generalize to real users and achieve on par performance with data-driven supervised learning baselines. △ Less

Submitted 27 October, 2023; v1 submitted 26 September, 2022; originally announced September 2022.

arXiv:2209.02485 [pdf, other]

Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors

Authors: Xi Wang, Gen Li, Yen-Ling Kuo, Muhammed Kocabas, Emre Aksan, Otmar Hilliges

Abstract: We present a method for inferring diverse 3D models of human-object interactions from images. Reasoning about how humans interact with objects in complex scenes from a single 2D image is a challenging task given ambiguities arising from the loss of information through projection. In addition, modeling 3D interactions requires the generalization ability towards diverse object categories and interac… ▽ More We present a method for inferring diverse 3D models of human-object interactions from images. Reasoning about how humans interact with objects in complex scenes from a single 2D image is a challenging task given ambiguities arising from the loss of information through projection. In addition, modeling 3D interactions requires the generalization ability towards diverse object categories and interaction types. We propose an action-conditioned modeling of interactions that allows us to infer diverse 3D arrangements of humans and objects without supervision on contact regions or 3D scene geometry. Our method extracts high-level commonsense knowledge from large language models (such as GPT-3), and applies them to perform 3D reasoning of human-object interactions. Our key insight is priors extracted from large language models can help in reasoning about human-object contacts from textural prompts only. We quantitatively evaluate the inferred 3D models on a large human-object interaction dataset and show how our method leads to better 3D reconstructions. We further qualitatively evaluate the effectiveness of our method on real images and demonstrate its generalizability towards interaction types and object categories. △ Less

Submitted 6 September, 2022; originally announced September 2022.

arXiv:2209.00489 [pdf, other]

TempCLR: Reconstructing Hands via Time-Coherent Contrastive Learning

Authors: Andrea Ziani, Zicong Fan, Muhammed Kocabas, Sammy Christen, Otmar Hilliges

Abstract: We introduce TempCLR, a new time-coherent contrastive learning approach for the structured regression task of 3D hand reconstruction. Unlike previous time-contrastive methods for hand pose estimation, our framework considers temporal consistency in its augmentation scheme, and accounts for the differences of hand poses along the temporal direction. Our data-driven method leverages unlabelled video… ▽ More We introduce TempCLR, a new time-coherent contrastive learning approach for the structured regression task of 3D hand reconstruction. Unlike previous time-contrastive methods for hand pose estimation, our framework considers temporal consistency in its augmentation scheme, and accounts for the differences of hand poses along the temporal direction. Our data-driven method leverages unlabelled videos and a standard CNN, without relying on synthetic data, pseudo-labels, or specialized architectures. Our approach improves the performance of fully-supervised hand reconstruction methods by 15.9% and 7.6% in PA-V2V on the HO-3D and FreiHAND datasets respectively, thus establishing new state-of-the-art performance. Finally, we demonstrate that our approach produces smoother hand reconstructions through time, and is more robust to heavy occlusions compared to the previous state-of-the-art which we show quantitatively and qualitatively. Our code and models will be available at https://eth-ait.github.io/tempclr. △ Less

Submitted 1 September, 2022; originally announced September 2022.

arXiv:2206.08428 [pdf, other]

doi 10.1145/3528223.3530130

EyeNeRF: A Hybrid Representation for Photorealistic Synthesis, Animation and Relighting of Human Eyes

Authors: Gengyan Li, Abhimitra Meka, Franziska Müller, Marcel C. Bühler, Otmar Hilliges, Thabo Beeler

Abstract: A unique challenge in creating high-quality animatable and relightable 3D avatars of people is modeling human eyes. The challenge of synthesizing eyes is multifold as it requires 1) appropriate representations for the various components of the eye and the periocular region for coherent viewpoint synthesis, capable of representing diffuse, refractive and highly reflective surfaces, 2) disentangling… ▽ More A unique challenge in creating high-quality animatable and relightable 3D avatars of people is modeling human eyes. The challenge of synthesizing eyes is multifold as it requires 1) appropriate representations for the various components of the eye and the periocular region for coherent viewpoint synthesis, capable of representing diffuse, refractive and highly reflective surfaces, 2) disentangling skin and eye appearance from environmental illumination such that it may be rendered under novel lighting conditions, and 3) capturing eyeball motion and the deformation of the surrounding skin to enable re-gazing. These challenges have traditionally necessitated the use of expensive and cumbersome capture setups to obtain high-quality results, and even then, modeling of the eye region holistically has remained elusive. We present a novel geometry and appearance representation that enables high-fidelity capture and photorealistic animation, view synthesis and relighting of the eye region using only a sparse set of lights and cameras. Our hybrid representation combines an explicit parametric surface model for the eyeball with implicit deformable volumetric representations for the periocular region and the interior of the eye. This novel hybrid model has been designed to address the various parts of that challenging facial area - the explicit eyeball surface allows modeling refraction and high-frequency specular reflection at the cornea, whereas the implicit representation is well suited to model lower-frequency skin reflection via spherical harmonics and can represent non-surface structures such as hair or diffuse volumetric bodies, both of which are a challenge for explicit surface models. We show that for high-resolution close-ups of the eye, our model can synthesize high-fidelity animated gaze from novel views under unseen illumination conditions. △ Less

Submitted 12 July, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

Comments: 16 pages, 16 figures, 1 table, to be published in ACM Transactions on Graphics (TOG) (Volume: 41, Issue: 4), 2022

ACM Class: I.4.5; I.3

arXiv:2205.13528 [pdf, other]

SFP: State-free Priors for Exploration in Off-Policy Reinforcement Learning

Authors: Marco Bagatella, Sammy Christen, Otmar Hilliges

Abstract: Efficient exploration is a crucial challenge in deep reinforcement learning. Several methods, such as behavioral priors, are able to leverage offline data in order to efficiently accelerate reinforcement learning on complex tasks. However, if the task at hand deviates excessively from the demonstrated task, the effectiveness of such methods is limited. In our work, we propose to learn features fro… ▽ More Efficient exploration is a crucial challenge in deep reinforcement learning. Several methods, such as behavioral priors, are able to leverage offline data in order to efficiently accelerate reinforcement learning on complex tasks. However, if the task at hand deviates excessively from the demonstrated task, the effectiveness of such methods is limited. In our work, we propose to learn features from offline data that are shared by a more diverse range of tasks, such as correlation between actions and directedness. Therefore, we introduce state-free priors, which directly model temporal consistency in demonstrated trajectories, and are capable of driving exploration in complex tasks, even when trained on data collected on simpler tasks. Furthermore, we introduce a novel integration scheme for action priors in off-policy reinforcement learning by dynamically sampling actions from a probabilistic mixture of policy and action prior. We compare our approach against strong baselines and provide empirical evidence that it can accelerate reinforcement learning in long-horizon continuous control tasks under sparse reward settings. △ Less

Submitted 31 August, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

Comments: Accepted by TMLR in August 2022. Project page at https://eth-ait.github.io/sfp/

arXiv:2204.13662 [pdf, other]

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation

Authors: Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, Otmar Hilliges

Abstract: Humans intuitively understand that inanimate objects do not move by themselves, but that state changes are typically caused by human manipulation (e.g., the opening of a book). This is not yet the case for machines. In part this is because there exist no datasets with ground-truth 3D annotations for the study of physically consistent and synchronised motion of hands and articulated objects. To thi… ▽ More Humans intuitively understand that inanimate objects do not move by themselves, but that state changes are typically caused by human manipulation (e.g., the opening of a book). This is not yet the case for machines. In part this is because there exist no datasets with ground-truth 3D annotations for the study of physically consistent and synchronised motion of hands and articulated objects. To this end, we introduce ARCTIC -- a dataset of two hands that dexterously manipulate objects, containing 2.1M video frames paired with accurate 3D hand and object meshes and detailed, dynamic contact information. It contains bi-manual articulation of objects such as scissors or laptops, where hand poses and object states evolve jointly in time. We propose two novel articulated hand-object interaction tasks: (1) Consistent motion reconstruction: Given a monocular video, the goal is to reconstruct two hands and articulated objects in 3D, so that their motions are spatio-temporally consistent. (2) Interaction field estimation: Dense relative hand-object distances must be estimated from images. We introduce two baselines ArcticNet and InterField, respectively and evaluate them qualitatively and quantitatively on ARCTIC. Our code and data are available at https://arctic.is.tue.mpg.de. △ Less

Submitted 23 April, 2023; v1 submitted 28 April, 2022; originally announced April 2022.

Comments: Project page: https://arctic.is.tue.mpg.de

Showing 1–50 of 111 results for author: Hilliges, O