Search | arXiv e-print repository

3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing

Authors: Balamurugan Thambiraja, Sadegh Aliakbarian, Darren Cosker, Justus Thies

Abstract: We present 3DiFACE, a novel method for personalized speech-driven 3D facial animation and editing. While existing methods deterministically predict facial animations from speech, they overlook the inherent one-to-many relationship between speech and facial expressions, i.e., there are multiple reasonable facial expression animations matching an audio input. It is especially important in content cr… ▽ More We present 3DiFACE, a novel method for personalized speech-driven 3D facial animation and editing. While existing methods deterministically predict facial animations from speech, they overlook the inherent one-to-many relationship between speech and facial expressions, i.e., there are multiple reasonable facial expression animations matching an audio input. It is especially important in content creation to be able to modify generated motion or to specify keyframes. To enable stochasticity as well as motion editing, we propose a lightweight audio-conditioned diffusion model for 3D facial motion. This diffusion model can be trained on a small 3D motion dataset, maintaining expressive lip motion output. In addition, it can be finetuned for specific subjects, requiring only a short video of the person. Through quantitative and qualitative evaluations, we show that our method outperforms existing state-of-the-art techniques and yields speech-driven animations with greater fidelity and diversity. △ Less

Submitted 1 December, 2023; originally announced December 2023.

Comments: Project page: https://balamuruganthambiraja.github.io/3DiFACE/

arXiv:2308.11261 [pdf, other]

HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations

Authors: Sadegh Aliakbarian, Fatemeh Saleh, David Collier, Pashmina Cameron, Darren Cosker

Abstract: Generating both plausible and accurate full body avatar motion is the key to the quality of immersive experiences in mixed reality scenarios. Head-Mounted Devices (HMDs) typically only provide a few input signals, such as head and hands 6-DoF. Recently, different approaches achieved impressive performance in generating full body motion given only head and hands signal. However, to the best of our… ▽ More Generating both plausible and accurate full body avatar motion is the key to the quality of immersive experiences in mixed reality scenarios. Head-Mounted Devices (HMDs) typically only provide a few input signals, such as head and hands 6-DoF. Recently, different approaches achieved impressive performance in generating full body motion given only head and hands signal. However, to the best of our knowledge, all existing approaches rely on full hand visibility. While this is the case when, e.g., using motion controllers, a considerable proportion of mixed reality experiences do not involve motion controllers and instead rely on egocentric hand tracking. This introduces the challenge of partial hand visibility owing to the restricted field of view of the HMD. In this paper, we propose the first unified approach, HMD-NeMo, that addresses plausible and accurate full body motion generation even when the hands may be only partially visible. HMD-NeMo is a lightweight neural network that predicts the full body motion in an online and real-time fashion. At the heart of HMD-NeMo is the spatio-temporal encoder with novel temporally adaptable mask tokens that encourage plausible motion in the absence of hand observations. We perform extensive analysis of the impact of different components in HMD-NeMo and introduce a new state-of-the-art on AMASS dataset through our evaluation. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV 2023

arXiv:2304.06024 [pdf, other]

Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views

Authors: Siwei Zhang, Qianli Ma, Yan Zhang, Sadegh Aliakbarian, Darren Cosker, Siyu Tang

Abstract: Automatic perception of human behaviors during social interactions is crucial for AR/VR applications, and an essential component is estimation of plausible 3D human pose and shape of our social partners from the egocentric view. One of the biggest challenges of this task is severe body truncation due to close social distances in egocentric scenarios, which brings large pose ambiguities for unseen… ▽ More Automatic perception of human behaviors during social interactions is crucial for AR/VR applications, and an essential component is estimation of plausible 3D human pose and shape of our social partners from the egocentric view. One of the biggest challenges of this task is severe body truncation due to close social distances in egocentric scenarios, which brings large pose ambiguities for unseen body parts. To tackle this challenge, we propose a novel scene-conditioned diffusion method to model the body pose distribution. Conditioned on the 3D scene geometry, the diffusion model generates bodies in plausible human-scene interactions, with the sampling guided by a physics-based collision score to further resolve human-scene inter-penetrations. The classifier-free training enables flexible sampling with different conditions and enhanced diversity. A visibility-aware graph convolution model guided by per-joint visibility serves as the diffusion denoiser to incorporate inter-joint dependencies and per-body-part control. Extensive evaluations show that our method generates bodies in plausible interactions with 3D scenes, achieving both superior accuracy for visible joints and diversity for invisible body parts. The code is available at https://sanweiliti.github.io/egohmr/egohmr.html. △ Less

Submitted 16 September, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

Comments: Camera ready version for ICCV 2023, appendix included

arXiv:2301.00023 [pdf, other]

Imitator: Personalized Speech-driven 3D Facial Animation

Authors: Balamurugan Thambiraja, Ikhsanul Habibie, Sadegh Aliakbarian, Darren Cosker, Christian Theobalt, Justus Thies

Abstract: Speech-driven 3D facial animation has been widely explored, with applications in gaming, character animation, virtual reality, and telepresence systems. State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor, thus, resulting in unrealistic and inaccurate lip… ▽ More Speech-driven 3D facial animation has been widely explored, with applications in gaming, character animation, virtual reality, and telepresence systems. State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor, thus, resulting in unrealistic and inaccurate lip movements. To address this, we present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video and produces novel facial expressions matching the identity-specific speaking style and facial idiosyncrasies of the target actor. Specifically, we train a style-agnostic transformer on a large facial expression dataset which we use as a prior for audio-driven facial expressions. Based on this prior, we optimize for identity-specific speaking style based on a short reference video. To train the prior, we introduce a novel loss function based on detected bilabial consonants to ensure plausible lip closures and consequently improve the realism of the generated expressions. Through detailed experiments and a user study, we show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors. △ Less

Submitted 30 December, 2022; originally announced January 2023.

Comments: https://youtu.be/JhXTdjiUCUw

arXiv:2107.07330 [pdf, other]

DynaDog+T: A Parametric Animal Model for Synthetic Canine Image Generation

Authors: Jake Deane, Sinead Kearney, Kwang In Kim, Darren Cosker

Abstract: Synthetic data is becoming increasingly common for training computer vision models for a variety of tasks. Notably, such data has been applied in tasks related to humans such as 3D pose estimation where data is either difficult to create or obtain in realistic settings. Comparatively, there has been less work into synthetic animal data and it's uses for training models. Consequently, we introduce… ▽ More Synthetic data is becoming increasingly common for training computer vision models for a variety of tasks. Notably, such data has been applied in tasks related to humans such as 3D pose estimation where data is either difficult to create or obtain in realistic settings. Comparatively, there has been less work into synthetic animal data and it's uses for training models. Consequently, we introduce a parametric canine model, DynaDog+T, for generating synthetic canine images and data which we use for a common computer vision task, binary segmentation, which would otherwise be difficult due to the lack of available data. △ Less

Submitted 20 July, 2021; v1 submitted 15 July, 2021; originally announced July 2021.

Comments: CV4Animals Workshop in CVPR 2021. Update to correct minor spelling and grammer mistakes in supplementary material

arXiv:2107.00480 [pdf, other]

EmoGen: Quantifiable Emotion Generation and Analysis for Experimental Psychology

Authors: Nadejda Roubtsova, Martin Parsons, Nicola Binetti, Isabelle Mareschal, Essi Viding, Darren Cosker

Abstract: 3D facial modelling and animation in computer vision and graphics traditionally require either digital artist's skill or complex pipelines with objective-function-based solvers to fit models to motion capture. This inaccessibility of quality modelling to a non-expert is an impediment to effective quantitative study of facial stimuli in experimental psychology. The EmoGen methodology we present in… ▽ More 3D facial modelling and animation in computer vision and graphics traditionally require either digital artist's skill or complex pipelines with objective-function-based solvers to fit models to motion capture. This inaccessibility of quality modelling to a non-expert is an impediment to effective quantitative study of facial stimuli in experimental psychology. The EmoGen methodology we present in this paper solves the issue democratising facial modelling technology. EmoGen is a robust and configurable framework letting anyone author arbitrary quantifiable facial expressions in 3D through a user-guided genetic algorithm search. Beyond sample generation, our methodology is made complete with techniques to analyse distributions of these expressions in a principled way. This paper covers the technical aspects of expression generation, specifically our production-quality facial blendshape model, automatic corrective mechanisms of implausible facial configurations in the absence of artist's supervision and the genetic algorithm implementation employed in the model space search. Further, we provide a comparative evaluation of ways to quantify generated facial expressions in the blendshape and geometric domains and compare them theoretically and empirically. The purpose of this analysis is 1. to define a similarity cost function to simulate model space search for convergence and parameter dependence assessment of the genetic algorithm and 2. to inform the best practices in the data distribution analysis for experimental psychology. △ Less

Submitted 1 July, 2021; originally announced July 2021.

arXiv:2004.07788 [pdf, other]

RGBD-Dog: Predicting Canine Pose from RGBD Sensors

Authors: Sinead Kearney, Wenbin Li, Martin Parsons, Kwang In Kim, Darren Cosker

Abstract: The automatic extraction of animal \reb{3D} pose from images without markers is of interest in a range of scientific fields. Most work to date predicts animal pose from RGB images, based on 2D labelling of joint positions. However, due to the difficult nature of obtaining training data, no ground truth dataset of 3D animal motion is available to quantitatively evaluate these approaches. In additio… ▽ More The automatic extraction of animal \reb{3D} pose from images without markers is of interest in a range of scientific fields. Most work to date predicts animal pose from RGB images, based on 2D labelling of joint positions. However, due to the difficult nature of obtaining training data, no ground truth dataset of 3D animal motion is available to quantitatively evaluate these approaches. In addition, a lack of 3D animal pose data also makes it difficult to train 3D pose-prediction methods in a similar manner to the popular field of body-pose prediction. In our work, we focus on the problem of 3D canine pose estimation from RGBD images, recording a diverse range of dog breeds with several Microsoft Kinect v2s, simultaneously obtaining the 3D ground truth skeleton via a motion capture system. We generate a dataset of synthetic RGBD images from this data. A stacked hourglass network is trained to predict 3D joint locations, which is then constrained using prior models of shape and pose. We evaluate our model on both synthetic and real RGBD images and compare our results to previously published work fitting canine models to images. Finally, despite our training set consisting only of dog data, visual inspection implies that our network can produce good predictions for images of other quadrupeds -- e.g. horses or cats -- when their pose is similar to that contained in our training set. △ Less

Submitted 16 April, 2020; originally announced April 2020.

Comments: 18 pages, 16 figures, to be published in CVPR 2020

arXiv:1806.02311 [pdf, other]

Unsupervised Attention-guided Image to Image Translation

Authors: Youssef A. Mejjati, Christian Richardt, James Tompkin, Darren Cosker, Kwang In Kim

Abstract: Current unsupervised image-to-image translation techniques struggle to focus their attention on individual objects without altering the background or the way multiple objects interact within a scene. Motivated by the important role of attention in human perception, we tackle this limitation by introducing unsupervised attention mechanisms that are jointly adversarialy trained with the generators a… ▽ More Current unsupervised image-to-image translation techniques struggle to focus their attention on individual objects without altering the background or the way multiple objects interact within a scene. Motivated by the important role of attention in human perception, we tackle this limitation by introducing unsupervised attention mechanisms that are jointly adversarialy trained with the generators and discriminators. We demonstrate qualitatively and quantitatively that our approach is able to attend to relevant regions in the image without requiring supervision, and that by doing so it achieves more realistic map**s compared to recent approaches. △ Less

Submitted 8 November, 2018; v1 submitted 6 June, 2018; originally announced June 2018.

Journal ref: NIPS 2018

arXiv:1710.01802 [pdf, ps, other]

doi 10.1371/journal.pone.0187513

Automatic Structural Scene Digitalization

Authors: Rui Tang, Yuhan Wang, Darren Cosker, Wenbin Li

Abstract: In this paper, we present an automatic system for the analysis and labeling of structural scenes, floor plan drawings in Computer-aided Design (CAD) format. The proposed system applies a fusion strategy to detect and recognize various components of CAD floor plans, such as walls, doors, windows and other ambiguous assets. Technically, a general rule-based filter parsing method is fist adopted to e… ▽ More In this paper, we present an automatic system for the analysis and labeling of structural scenes, floor plan drawings in Computer-aided Design (CAD) format. The proposed system applies a fusion strategy to detect and recognize various components of CAD floor plans, such as walls, doors, windows and other ambiguous assets. Technically, a general rule-based filter parsing method is fist adopted to extract effective information from the original floor plan. Then, an image-processing based recovery method is employed to correct information extracted in the first step. Our proposed method is fully automatic and real-time. Such analysis system provides high accuracy and is also evaluated on a public website that, on average, archives more than ten thousands effective uses per day and reaches a relatively high satisfaction rate. △ Less

Submitted 19 September, 2017; originally announced October 2017.

Comments: paper submitted to PloS One

arXiv:1704.05817 [pdf, other]

Learn to Model Motion from Blurry Footages

Authors: Wenbin Li, Da Chen, Zhihan Lv, Yan Yan, Darren Cosker

Abstract: It is difficult to recover the motion field from a real-world footage given a mixture of camera shake and other photometric effects. In this paper we propose a hybrid framework by interleaving a Convolutional Neural Network (CNN) and a traditional optical flow energy. We first conduct a CNN architecture using a novel learnable directional filtering layer. Such layer encodes the angle and distance… ▽ More It is difficult to recover the motion field from a real-world footage given a mixture of camera shake and other photometric effects. In this paper we propose a hybrid framework by interleaving a Convolutional Neural Network (CNN) and a traditional optical flow energy. We first conduct a CNN architecture using a novel learnable directional filtering layer. Such layer encodes the angle and distance similarity matrix between blur and camera motion, which is able to enhance the blur features of the camera-shake footages. The proposed CNNs are then integrated into an iterative optical flow framework, which enable the capability of modelling and solving both the blind deconvolution and the optical flow estimation problems simultaneously. Our framework is trained end-to-end on a synthetic dataset and yields competitive precision and performance against the state-of-the-art approaches. △ Less

Submitted 19 April, 2017; originally announced April 2017.

Comments: Preprint of our paper accepted by Pattern Recognition

arXiv:1608.00762 [pdf, other]

doi 10.1364/JOSAA.33.001798

Interactive Removal and Ground Truth for Difficult Shadow Scenes

Authors: Han Gong, Darren P. Cosker

Abstract: A user-centric method for fast, interactive, robust and high-quality shadow removal is presented. Our algorithm can perform detection and removal in a range of difficult cases: such as highly textured and colored shadows. To perform detection an on-the-fly learning approach is adopted guided by two rough user inputs for the pixels of the shadow and the lit area. After detection, shadow removal is… ▽ More A user-centric method for fast, interactive, robust and high-quality shadow removal is presented. Our algorithm can perform detection and removal in a range of difficult cases: such as highly textured and colored shadows. To perform detection an on-the-fly learning approach is adopted guided by two rough user inputs for the pixels of the shadow and the lit area. After detection, shadow removal is performed by registering the penumbra to a normalized frame which allows us efficient estimation of non-uniform shadow illumination changes, resulting in accurate and robust removal. Another major contribution of this work is the first validated and multi-scene category ground truth for shadow removal algorithms. This data set containing 186 images eliminates inconsistencies between shadow and shadow-free images and provides a range of different shadow types such as soft, textured, colored and broken shadow. Using this data, the most thorough comparison of state-of-the-art shadow removal methods to date is performed, showing our proposed new algorithm to outperform the state-of-the-art across several measures and shadow category. To complement our dataset, an online shadow removal benchmark website is also presented to encourage future open comparisons in this challenging field of research. △ Less

Submitted 2 August, 2016; originally announced August 2016.

Comments: Accepted by JOSA A

arXiv:1603.08124 [pdf, other]

Video Interpolation using Optical Flow and Laplacian Smoothness

Authors: Wenbin Li, Darren Cosker

Abstract: Non-rigid video interpolation is a common computer vision task. In this paper we present an optical flow approach which adopts a Laplacian Cotangent Mesh constraint to enhance the local smoothness. Similar to Li et al., our approach adopts a mesh to the image with a resolution up to one vertex per pixel and uses angle constraints to ensure sensible local deformations between image pairs. The Lapla… ▽ More Non-rigid video interpolation is a common computer vision task. In this paper we present an optical flow approach which adopts a Laplacian Cotangent Mesh constraint to enhance the local smoothness. Similar to Li et al., our approach adopts a mesh to the image with a resolution up to one vertex per pixel and uses angle constraints to ensure sensible local deformations between image pairs. The Laplacian Mesh constraints are expressed wholly inside the optical flow optimization, and can be applied in a straightforward manner to a wide range of image tracking and registration problems. We evaluate our approach by testing on several benchmark datasets, including the Middlebury and Garg et al. datasets. In addition, we show application of our method for constructing 3D Morphable Facial Models from dynamic 3D data. △ Less

Submitted 26 March, 2016; originally announced March 2016.

arXiv:1603.08120 [pdf, other]

doi 10.1109/LRA.2016.2592513

Nonrigid Optical Flow Ground Truth for Real-World Scenes with Time-Varying Shading Effects

Authors: Wenbin Li, Darren Cosker, Zhihan Lv, Matthew Brown

Abstract: In this paper we present a dense ground truth dataset of nonrigidly deforming real-world scenes. Our dataset contains both long and short video sequences, and enables the quantitatively evaluation for RGB based tracking and registration methods. To construct ground truth for the RGB sequences, we simultaneously capture Near-Infrared (NIR) image sequences where dense markers - visible only in NIR -… ▽ More In this paper we present a dense ground truth dataset of nonrigidly deforming real-world scenes. Our dataset contains both long and short video sequences, and enables the quantitatively evaluation for RGB based tracking and registration methods. To construct ground truth for the RGB sequences, we simultaneously capture Near-Infrared (NIR) image sequences where dense markers - visible only in NIR - represent ground truth positions. This allows for comparison with automatically tracked RGB positions and the formation of error metrics. Most previous datasets containing nonrigidly deforming sequences are based on synthetic data. Our capture protocol enables us to acquire real-world deforming objects with realistic photometric effects - such as blur and illumination change - as well as occlusion and complex deformations. A public evaluation website is constructed to allow for ranking of RGB image based optical flow and other dense tracking algorithms, with various statistical measures. Furthermore, we present an RGB-NIR multispectral optical flow model allowing for energy optimization by adoptively combining featured information from both the RGB and the complementary NIR channels. In our experiments we evaluate eight existing RGB based optical flow methods on our new dataset. We also evaluate our hybrid optical flow algorithm by comparing to two existing multispectral approaches, as well as varying our input channels across RGB, NIR and RGB-NIR. △ Less

Submitted 15 July, 2016; v1 submitted 26 March, 2016; originally announced March 2016.

Comments: preprint of our paper accepted by RA-L'16

arXiv:1603.02253 [pdf, other]

Blur Robust Optical Flow using Motion Channel

Authors: Wenbin Li, Yang Chen, JeeHang Lee, Gang Ren, Darren Cosker

Abstract: It is hard to estimate optical flow given a realworld video sequence with camera shake and other motion blur. In this paper, we first investigate the blur parameterization for video footage using near linear motion elements. we then combine a commercial 3D pose sensor with an RGB camera, in order to film video footage of interest together with the camera motion. We illustrates that this additional… ▽ More It is hard to estimate optical flow given a realworld video sequence with camera shake and other motion blur. In this paper, we first investigate the blur parameterization for video footage using near linear motion elements. we then combine a commercial 3D pose sensor with an RGB camera, in order to film video footage of interest together with the camera motion. We illustrates that this additional camera motion/trajectory channel can be embedded into a hybrid framework by interleaving an iterative blind deconvolution and war** based optical flow scheme. Our method yields improved accuracy within three other state-of-the-art baselines given our proposed ground truth blurry sequences; and several other realworld sequences filmed by our imaging system. △ Less

Submitted 7 March, 2016; originally announced March 2016.

Comments: Preprint of our paper accepted by Neurocomputing

arXiv:1603.02252 [pdf, other]

Drift Robust Non-rigid Optical Flow Enhancement for Long Sequences

Authors: Wenbin Li, Darren Cosker, Matthew Brown

Abstract: It is hard to densely track a nonrigid object in long term, which is a fundamental research issue in the computer vision community. This task often relies on estimating pairwise correspondences between images over time where the error is accumulated and leads to a drift issue. In this paper, we introduce a novel optimization framework with an Anchor Patch constraint. It is supposed to significantl… ▽ More It is hard to densely track a nonrigid object in long term, which is a fundamental research issue in the computer vision community. This task often relies on estimating pairwise correspondences between images over time where the error is accumulated and leads to a drift issue. In this paper, we introduce a novel optimization framework with an Anchor Patch constraint. It is supposed to significantly reduce overall errors given long sequences containing non-rigidly deformable objects. Our framework can be applied to any dense tracking algorithm, e.g. optical flow. We demonstrate the success of our approach by showing significant error reduction on 6 popular optical flow algorithms applied to a range of real-world nonrigid benchmarks. We also provide quantitative analysis of our approach given synthetic occlusions and image noise. △ Less

Submitted 7 March, 2016; originally announced March 2016.

Comments: Preprint version of our paper accepted by Journal of Intelligent and Fuzzy Systems

Showing 1–15 of 15 results for author: Cosker, D