-
FaceLift: Semi-supervised 3D Facial Landmark Localization
Authors:
David Ferman,
Pablo Garrido,
Gaurav Bharaj
Abstract:
3D facial landmark localization has proven to be of particular use for applications, such as face tracking, 3D face modeling, and image-based 3D face reconstruction. In the supervised learning case, such methods usually rely on 3D landmark datasets derived from 3DMM-based registration that often lack spatial definition alignment, as compared with that chosen by hand-labeled human consensus, e.g.,…
▽ More
3D facial landmark localization has proven to be of particular use for applications, such as face tracking, 3D face modeling, and image-based 3D face reconstruction. In the supervised learning case, such methods usually rely on 3D landmark datasets derived from 3DMM-based registration that often lack spatial definition alignment, as compared with that chosen by hand-labeled human consensus, e.g., how are eyebrow landmarks defined? This creates a gap between landmark datasets generated via high-quality 2D human labels and 3DMMs, and it ultimately limits their effectiveness. To address this issue, we introduce a novel semi-supervised learning approach that learns 3D landmarks by directly lifting (visible) hand-labeled 2D landmarks and ensures better definition alignment, without the need for 3D landmark datasets. To lift 2D landmarks to 3D, we leverage 3D-aware GANs for better multi-view consistency learning and in-the-wild multi-frame videos for robust cross-generalization. Empirical experiments demonstrate that our method not only achieves better definition alignment between 2D-3D landmarks but also outperforms other supervised learning 3D landmark localization methods on both 3DMM labeled and photogrammetric ground truth evaluation datasets. Project Page: https://davidcferman.github.io/FaceLift
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Implicit Neural Head Synthesis via Controllable Local Deformation Fields
Authors:
Chuhan Chen,
Matthew O'Toole,
Gaurav Bharaj,
Pablo Garrido
Abstract:
High-quality reconstruction of controllable 3D head avatars from 2D videos is highly desirable for virtual human applications in movies, games, and telepresence. Neural implicit fields provide a powerful representation to model 3D head avatars with personalized shape, expressions, and facial parts, e.g., hair and mouth interior, that go beyond the linear 3D morphable model (3DMM). However, existin…
▽ More
High-quality reconstruction of controllable 3D head avatars from 2D videos is highly desirable for virtual human applications in movies, games, and telepresence. Neural implicit fields provide a powerful representation to model 3D head avatars with personalized shape, expressions, and facial parts, e.g., hair and mouth interior, that go beyond the linear 3D morphable model (3DMM). However, existing methods do not model faces with fine-scale facial features, or local control of facial parts that extrapolate asymmetric expressions from monocular videos. Further, most condition only on 3DMM parameters with poor(er) locality, and resolve local features with a global neural field. We build on part-based implicit shape models that decompose a global deformation field into local ones. Our novel formulation models multiple implicit deformation fields with local semantic rig-like control via 3DMM-based parameters, and representative facial landmarks. Further, we propose a local control loss and attention mask mechanism that promote sparsity of each learned deformation field. Our formulation renders sharper locally controllable nonlinear deformations than previous implicit monocular approaches, especially mouth interior, asymmetric expressions, and facial details.
△ Less
Submitted 21 April, 2023;
originally announced April 2023.
-
Few-shot Geometry-Aware Keypoint Localization
Authors:
Xingzhe He,
Gaurav Bharaj,
David Ferman,
Helge Rhodin,
Pablo Garrido
Abstract:
Supervised keypoint localization methods rely on large manually labeled image datasets, where objects can deform, articulate, or occlude. However, creating such large keypoint labels is time-consuming and costly, and is often error-prone due to inconsistent labeling. Thus, we desire an approach that can learn keypoint localization with fewer yet consistently annotated images. To this end, we prese…
▽ More
Supervised keypoint localization methods rely on large manually labeled image datasets, where objects can deform, articulate, or occlude. However, creating such large keypoint labels is time-consuming and costly, and is often error-prone due to inconsistent labeling. Thus, we desire an approach that can learn keypoint localization with fewer yet consistently annotated images. To this end, we present a novel formulation that learns to localize semantically consistent keypoint definitions, even for occluded regions, for varying object categories. We use a few user-labeled 2D images as input examples, which are extended via self-supervision using a larger unlabeled dataset. Unlike unsupervised methods, the few-shot images act as semantic shape constraints for object localization. Furthermore, we introduce 3D geometry-aware constraints to uplift keypoints, achieving more accurate 2D localization. Our general-purpose formulation paves the way for semantically conditioned generative modeling and attains competitive or state-of-the-art accuracy on several datasets, including human faces, eyes, animals, cars, and never-before-seen mouth interior (teeth) localization tasks, not attempted by the previous few-shot methods. Project page: https://xingzhehe.github.io/FewShot3DKP/}{https://xingzhehe.github.io/FewShot3DKP/
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
HQ3DAvatar: High Quality Controllable 3D Head Avatar
Authors:
Kartik Teotia,
Mallikarjun B R,
Xingang Pan,
Hyeongwoo Kim,
Pablo Garrido,
Mohamed Elgharib,
Christian Theobalt
Abstract:
Multi-view volumetric rendering techniques have recently shown great potential in modeling and synthesizing high-quality head avatars. A common approach to capture full head dynamic performances is to track the underlying geometry using a mesh-based template or 3D cube-based graphics primitives. While these model-based approaches achieve promising results, they often fail to learn complex geometri…
▽ More
Multi-view volumetric rendering techniques have recently shown great potential in modeling and synthesizing high-quality head avatars. A common approach to capture full head dynamic performances is to track the underlying geometry using a mesh-based template or 3D cube-based graphics primitives. While these model-based approaches achieve promising results, they often fail to learn complex geometric details such as the mouth interior, hair, and topological changes over time. This paper presents a novel approach to building highly photorealistic digital head avatars. Our method learns a canonical space via an implicit function parameterized by a neural network. It leverages multiresolution hash encoding in the learned feature space, allowing for high-quality, faster training and high-resolution rendering. At test time, our method is driven by a monocular RGB video. Here, an image encoder extracts face-specific features that also condition the learnable canonical space. This encourages deformation-dependent texture variations during training. We also propose a novel optical flow based loss that ensures correspondences in the learned canonical space, thus encouraging artifact-free and temporally consistent renderings. We show results on challenging facial expressions and show free-viewpoint renderings at interactive real-time rates for medium image resolutions. Our method outperforms all existing approaches, both visually and numerically. We will release our multiple-identity dataset to encourage further research. Our Project page is available at: https://vcai.mpi-inf.mpg.de/projects/HQ3DAvatar/
△ Less
Submitted 25 March, 2023;
originally announced March 2023.
-
Poisson's CDF applied to Flexible Skylines
Authors:
Jaime Pons Garrido
Abstract:
The evolution of skyline and ranking queries has created new archetypes like flexible skylines, which have proven to be an efficient method to select relevant data from large datasets using multi objective optimization. This paper aims to study the possible applications of Poisson distribution mass function as a monotonic scoring function in flexible skyline processes, especially those featuring s…
▽ More
The evolution of skyline and ranking queries has created new archetypes like flexible skylines, which have proven to be an efficient method to select relevant data from large datasets using multi objective optimization. This paper aims to study the possible applications of Poisson distribution mass function as a monotonic scoring function in flexible skyline processes, especially those featuring schemas whose attributes can be translated to constant mean rates. Moreover, a method to express users's requirement by means of the F-dominant set of tuples will be proposed using parametrical variations in F[1], simultaneously, algorithm construction and potential applications will be studied.
△ Less
Submitted 30 January, 2022; v1 submitted 25 January, 2022;
originally announced January 2022.
-
Fog to cloud and network coded based architecture: Minimizing data download time for smart mobility
Authors:
Goiuri Peralta,
Pablo Garrido,
Josu Bilbao,
Ramón Agüero,
Pedro M. Crespo
Abstract:
Industry 4.0 applications foster new business opportunities but they also pose new and challenging requirements, such as low latency communications and highly reliable systems. They enable to exploit novel wireless technologies (5G), but it would also be crucial to rely on architectures that appropriately support them. Thus, the combination of fog and cloud computing is emerging as one potential s…
▽ More
Industry 4.0 applications foster new business opportunities but they also pose new and challenging requirements, such as low latency communications and highly reliable systems. They enable to exploit novel wireless technologies (5G), but it would also be crucial to rely on architectures that appropriately support them. Thus, the combination of fog and cloud computing is emerging as one potential solution. It can dynamically allocate the workload depending on the specific needs of each application. Our main goal is to provide a highly reliable and dynamic architecture, which minimizes the time that an end node or user, for instance a car in a smart mobility application, spends in downloading the required data. In order to achieve this, we have developed an optimal distribution algorithm that decides, based on multiple parameters of the proposed system model, the amount of information that should be stored at, or retrieved from, each node to minimize the data download time. Our scheme exploits Network Coding (NC) as a tool for data distribution, as a key enabler of the proposed solution. We compare the performance of our proposed scheme with other alternative solutions, and the results show that there is a clear gain in terms of the download time.
△ Less
Submitted 22 January, 2020; v1 submitted 2 December, 2019;
originally announced December 2019.
-
FML: Face Model Learning from Videos
Authors:
Ayush Tewari,
Florian Bernard,
Pablo Garrido,
Gaurav Bharaj,
Mohamed Elgharib,
Hans-Peter Seidel,
Patrick Pérez,
Michael Zollhöfer,
Christian Theobalt
Abstract:
Monocular image-based 3D reconstruction of faces is a long-standing problem in computer vision. Since image data is a 2D projection of a 3D face, the resulting depth ambiguity makes the problem ill-posed. Most existing methods rely on data-driven priors that are built from limited 3D face scans. In contrast, we propose multi-frame video-based self-supervised training of a deep network that (i) lea…
▽ More
Monocular image-based 3D reconstruction of faces is a long-standing problem in computer vision. Since image data is a 2D projection of a 3D face, the resulting depth ambiguity makes the problem ill-posed. Most existing methods rely on data-driven priors that are built from limited 3D face scans. In contrast, we propose multi-frame video-based self-supervised training of a deep network that (i) learns a face identity model both in shape and appearance while (ii) jointly learning to reconstruct 3D faces. Our face model is learned using only corpora of in-the-wild video clips collected from the Internet. This virtually endless source of training data enables learning of a highly general 3D face model. In order to achieve this, we propose a novel multi-frame consistency loss that ensures consistent shape and appearance across multiple frames of a subject's face, thus minimizing depth ambiguity. At test time we can use an arbitrary number of frames, so that we can perform both monocular as well as multi-frame reconstruction.
△ Less
Submitted 9 April, 2019; v1 submitted 18 December, 2018;
originally announced December 2018.
-
Deep Video Portraits
Authors:
Hyeongwoo Kim,
Pablo Garrido,
Ayush Tewari,
Weipeng Xu,
Justus Thies,
Matthias Nießner,
Patrick Pérez,
Christian Richardt,
Michael Zollhöfer,
Christian Theobalt
Abstract:
We present a novel approach that enables photo-realistic re-animation of portrait videos using only an input video. In contrast to existing approaches that are restricted to manipulations of facial expressions only, we are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor. The core o…
▽ More
We present a novel approach that enables photo-realistic re-animation of portrait videos using only an input video. In contrast to existing approaches that are restricted to manipulations of facial expressions only, we are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor. The core of our approach is a generative neural network with a novel space-time architecture. The network takes as input synthetic renderings of a parametric face model, based on which it predicts photo-realistic video frames for a given target actor. The realism in this rendering-to-video transfer is achieved by careful adversarial training, and as a result, we can create modified target videos that mimic the behavior of the synthetically-created input. In order to enable source-to-target video re-animation, we render a synthetic target video with the reconstructed head animation parameters from a source video, and feed it into the trained network -- thus taking full control of the target. With the ability to freely recombine source and target parameters, we are able to demonstrate a large variety of video rewrite applications without explicitly modeling hair, body or background. For instance, we can reenact the full head using interactive user-controlled editing, and realize high-fidelity visual dubbing. To demonstrate the high quality of our output, we conduct an extensive series of experiments and evaluations, where for instance a user study shows that our video edits are hard to detect.
△ Less
Submitted 29 May, 2018;
originally announced May 2018.
-
Joint Scheduling and Coding For Low In-Order Delivery Delay Over Lossy Paths With Delayed Feedback
Authors:
Pablo Garrido,
Douglas Leith,
Ramon Aguero
Abstract:
We consider the transmission of packets across a lossy end-to-end network path so as to achieve low in-order delivery delay. This can be formulated as a decision problem, namely deciding whether the next packet to send should be an information packet or a coded packet. Importantly, this decision is made based on delayed feedback from the receiver. While an exact solution to this decision problem i…
▽ More
We consider the transmission of packets across a lossy end-to-end network path so as to achieve low in-order delivery delay. This can be formulated as a decision problem, namely deciding whether the next packet to send should be an information packet or a coded packet. Importantly, this decision is made based on delayed feedback from the receiver. While an exact solution to this decision problem is challenging, we exploit ideas from queueing theory to derive scheduling policies based on prediction of a receiver queue length that, while suboptimal, can be efficiently implemented and offer substantially better performance than state of the art approaches. We obtain a number of useful analytic bounds that help characterise design trade-offs and our analysis highlights that the use of prediction plays a key role in achieving good performance in the presence of significant feedback delay. Our approach readily generalises to networks of paths and we illustrate this by application to multipath transport scheduler design.
△ Less
Submitted 14 December, 2018; v1 submitted 13 April, 2018;
originally announced April 2018.
-
Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz
Authors:
Ayush Tewari,
Michael Zollhöfer,
Pablo Garrido,
Florian Bernard,
Hyeongwoo Kim,
Patrick Pérez,
Christian Theobalt
Abstract:
The reconstruction of dense 3D models of face geometry and appearance from a single image is highly challenging and ill-posed. To constrain the problem, many approaches rely on strong priors, such as parametric face models learned from limited 3D scan data. However, prior models restrict generalization of the true diversity in facial geometry, skin reflectance and illumination. To alleviate this p…
▽ More
The reconstruction of dense 3D models of face geometry and appearance from a single image is highly challenging and ill-posed. To constrain the problem, many approaches rely on strong priors, such as parametric face models learned from limited 3D scan data. However, prior models restrict generalization of the true diversity in facial geometry, skin reflectance and illumination. To alleviate this problem, we present the first approach that jointly learns 1) a regressor for face shape, expression, reflectance and illumination on the basis of 2) a concurrently learned parametric face model. Our multi-level face model combines the advantage of 3D Morphable Models for regularization with the out-of-space generalization of a learned corrective space. We train end-to-end on in-the-wild images without dense annotations by fusing a convolutional encoder with a differentiable expert-designed renderer and a self-supervised training loss, both defined at multiple detail levels. Our approach compares favorably to the state-of-the-art in terms of reconstruction quality, better generalizes to real world faces, and runs at over 250 Hz.
△ Less
Submitted 29 March, 2018; v1 submitted 7 December, 2017;
originally announced December 2017.
-
MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction
Authors:
Ayush Tewari,
Michael Zollhöfer,
Hyeongwoo Kim,
Pablo Garrido,
Florian Bernard,
Patrick Pérez,
Christian Theobalt
Abstract:
In this work we propose a novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image. To this end, we combine a convolutional encoder network with an expert-designed generative model that serves as decoder. The core innovation is our new differentiable parametric decoder that encapsulates im…
▽ More
In this work we propose a novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image. To this end, we combine a convolutional encoder network with an expert-designed generative model that serves as decoder. The core innovation is our new differentiable parametric decoder that encapsulates image formation analytically based on a generative model. Our decoder takes as input a code vector with exactly defined semantic meaning that encodes detailed face pose, shape, expression, skin reflectance and scene illumination. Due to this new way of combining CNN-based with model-based face reconstruction, the CNN-based encoder learns to extract semantically meaningful parameters from a single monocular input image. For the first time, a CNN encoder and an expert-designed generative model can be trained end-to-end in an unsupervised manner, which renders training on very large (unlabeled) real world data feasible. The obtained reconstructions compare favorably to current state-of-the-art approaches in terms of quality and richness of representation.
△ Less
Submitted 7 December, 2017; v1 submitted 30 March, 2017;
originally announced March 2017.
-
A Markov Chain Model for the Decoding Probability of Sparse Network Coding
Authors:
Garrido Pablo,
Lucani E. Daniel,
Aguero Ramon
Abstract:
Random Linear Network Coding (RLNC) has been proved to offer an efficient communication scheme, leveraging an interesting robustness against packet losses. However, it suffers from a high computational complexity and some novel approaches, which follow the same idea, have been recently proposed. One of such solutions is Tunable Sparse Network Coding (TSNC), where only few packets are combined in e…
▽ More
Random Linear Network Coding (RLNC) has been proved to offer an efficient communication scheme, leveraging an interesting robustness against packet losses. However, it suffers from a high computational complexity and some novel approaches, which follow the same idea, have been recently proposed. One of such solutions is Tunable Sparse Network Coding (TSNC), where only few packets are combined in each transmissions. The amount of data packets to be combined in each transmissions can be set from a density parameter/distribution, which could be eventually adapted. In this work we present an analytical model that captures the performance of SNC on an accurate way. We exploit an absorbing Markov process where the states are defined by the number of useful packets received by the decoder, i.e the decoding matrix rank, and the number of non-zero columns at such matrix. The model is validated by means of a thorough simulation campaign, and the difference between model and simulation is negligible. A mean square error less than $4 \cdot 10^{-4}$ in the worst cases. We also include in the comparison some of more general bounds that have been recently used, showing that their accuracy is rather poor. The proposed model would enable a more precise assessment of the behavior of sparse network coding techniques. The last results show that the proposed analytical model can be exploited by the TSNC techniques in order to select by the encoder the best density as the transmission evolves.
△ Less
Submitted 22 July, 2016; v1 submitted 18 July, 2016;
originally announced July 2016.
-
CONDITOR1: Topic Maps and DITA labelling tool for textual documents with historical information
Authors:
Piedad Garrido,
Jesus Tramullas,
Manuel Coll
Abstract:
Conditor is a software tool which works with textual documents containing historical information. The purpose of this work two-fold: firstly to show the validity of the developed engine to correctly identify and label the entities of the universe of discourse with a labelled-combined XTM-DITA model. Secondly to explain the improvements achieved in the information retrieval process thanks to the us…
▽ More
Conditor is a software tool which works with textual documents containing historical information. The purpose of this work two-fold: firstly to show the validity of the developed engine to correctly identify and label the entities of the universe of discourse with a labelled-combined XTM-DITA model. Secondly to explain the improvements achieved in the information retrieval process thanks to the use of a object-oriented database (JPOX) as well as its integration into the Lucene-type database search process to not only accomplish more accurate searches, but to also help the future development of a recommender system. We finish with a brief demo in a 3D-graph of the results of the aforementioned search.
△ Less
Submitted 23 March, 2016;
originally announced March 2016.
-
Automatic Face Reenactment
Authors:
Pablo Garrido,
Levi Valgaerts,
Ole Rehmsen,
Thorsten Thormaehlen,
Patrick Perez,
Christian Theobalt
Abstract:
We propose an image-based, facial reenactment system that replaces the face of an actor in an existing target video with the face of a user from a source video, while preserving the original target performance. Our system is fully automatic and does not require a database of source expressions. Instead, it is able to produce convincing reenactment results from a short source video captured with an…
▽ More
We propose an image-based, facial reenactment system that replaces the face of an actor in an existing target video with the face of a user from a source video, while preserving the original target performance. Our system is fully automatic and does not require a database of source expressions. Instead, it is able to produce convincing reenactment results from a short source video captured with an off-the-shelf camera, such as a webcam, where the user performs arbitrary facial gestures. Our reenactment pipeline is conceived as part image retrieval and part face transfer: The image retrieval is based on temporal clustering of target frames and a novel image matching metric that combines appearance and motion to select candidate frames from the source video, while the face transfer uses a 2D war** strategy that preserves the user's identity. Our system excels in simplicity as it does not rely on a 3D face model, it is robust under head motion and does not require the source and target performance to be similar. We show convincing reenactment results for videos that we recorded ourselves and for low-quality footage taken from the Internet.
△ Less
Submitted 8 February, 2016;
originally announced February 2016.