Search | arXiv e-print repository

3D Facial Expressions through Analysis-by-Neural-Synthesis

Authors: George Retsinas, Panagiotis P. Filntisis, Radek Danecek, Victoria F. Abrevaya, Anastasios Roussos, Timo Bolkart, Petros Maragos

Abstract: While existing methods for 3D face reconstruction from in-the-wild images excel at recovering the overall face shape, they commonly miss subtle, extreme, asymmetric, or rarely observed expressions. We improve upon these methods with SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics), which faithfully reconstructs expressive 3D faces from images. We identify two key limitations in… ▽ More While existing methods for 3D face reconstruction from in-the-wild images excel at recovering the overall face shape, they commonly miss subtle, extreme, asymmetric, or rarely observed expressions. We improve upon these methods with SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics), which faithfully reconstructs expressive 3D faces from images. We identify two key limitations in existing methods: shortcomings in their self-supervised training formulation, and a lack of expression diversity in the training images. For training, most methods employ differentiable rendering to compare a predicted face mesh with the input image, along with a plethora of additional loss functions. This differentiable rendering loss not only has to provide supervision to optimize for 3D face geometry, camera, albedo, and lighting, which is an ill-posed optimization problem, but the domain gap between rendering and input image further hinders the learning process. Instead, SMIRK replaces the differentiable rendering with a neural rendering module that, given the rendered predicted mesh geometry, and sparsely sampled pixels of the input image, generates a face image. As the neural rendering gets color information from sampled image pixels, supervising with neural rendering-based reconstruction loss can focus solely on the geometry. Further, it enables us to generate images of the input identity with varying expressions while training. These are then utilized as input to the reconstruction model and used as supervision with ground truth geometry. This effectively augments the training data and enhances the generalization for diverse expressions. Our qualitative, quantitative and particularly our perceptual evaluations demonstrate that SMIRK achieves the new state-of-the art performance on accurate expression reconstruction. Project webpage: https://georgeretsi.github.io/smirk/. △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2312.06613 [pdf, other]

Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism

Authors: Georgios Milis, Panagiotis P. Filntisis, Anastasios Roussos, Petros Maragos

Abstract: Recent advances in deep learning for sequential data have given rise to fast and powerful models that produce realistic videos of talking humans. The state of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips. However, having the ability to synthesize talking humans from text transcriptions rather than audio is particularly beneficial for many appli… ▽ More Recent advances in deep learning for sequential data have given rise to fast and powerful models that produce realistic videos of talking humans. The state of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips. However, having the ability to synthesize talking humans from text transcriptions rather than audio is particularly beneficial for many applications and is expected to receive more and more attention, following the recent breakthroughs in large language models. For that, most methods implement a cascaded 2-stage architecture of a text-to-speech module followed by an audio-driven talking face generator, but this ignores the highly complex interplay between audio and visual streams that occurs during speaking. In this paper, we propose the first, to the best of our knowledge, text-driven audiovisual speech synthesizer that uses Transformers and does not follow a cascaded approach. Our method, which we call NEUral Text to ARticulate Talk (NEUTART), is a talking face generator that uses a joint audiovisual feature space, as well as speech-informed 3D facial reconstructions and a lip-reading loss for visual supervision. The proposed model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams. Our experiments on audiovisual datasets as well as in-the-wild videos reveal state-of-the-art generation quality both in terms of objective metrics and human evaluation. △ Less

Submitted 11 December, 2023; originally announced December 2023.

arXiv:2209.13971 [pdf, other]

3D Neural Sculpting (3DNS): Editing Neural Signed Distance Functions

Authors: Petros Tzathas, Petros Maragos, Anastasios Roussos

Abstract: In recent years, implicit surface representations through neural networks that encode the signed distance have gained popularity and have achieved state-of-the-art results in various tasks (e.g. shape representation, shape reconstruction, and learning shape priors). However, in contrast to conventional shape representations such as polygon meshes, the implicit representations cannot be easily edit… ▽ More In recent years, implicit surface representations through neural networks that encode the signed distance have gained popularity and have achieved state-of-the-art results in various tasks (e.g. shape representation, shape reconstruction, and learning shape priors). However, in contrast to conventional shape representations such as polygon meshes, the implicit representations cannot be easily edited and existing works that attempt to address this problem are extremely limited. In this work, we propose the first method for efficient interactive editing of signed distance functions expressed through neural networks, allowing free-form editing. Inspired by 3D sculpting software for meshes, we use a brush-based framework that is intuitive and can in the future be used by sculptors and digital artists. In order to localize the desired surface deformations, we regulate the network by using a copy of it to sample the previously expressed surface. We introduce a novel framework for simulating sculpting-style surface edits, in conjunction with interactive surface sampling and efficient adaptation of network weights. We qualitatively and quantitatively evaluate our method in various different 3D objects and under many different edits. The reported results clearly show that our method yields high accuracy, in terms of achieving the desired edits, while at the same time preserving the geometry outside the interaction areas. △ Less

Submitted 27 January, 2023; v1 submitted 28 September, 2022; originally announced September 2022.

Comments: 14 pages, 10 figures, 3 tables

arXiv:2209.01470 [pdf, other]

Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting

Authors: Christina O. Tze, Panagiotis P. Filntisis, Athanasia-Lida Dimou, Anastasios Roussos, Petros Maragos

Abstract: In this paper, we introduce a neural rendering pipeline for transferring the facial expressions, head pose, and body movements of one person in a source video to another in a target video. We apply our method to the challenging case of Sign Language videos: given a source video of a sign language user, we can faithfully transfer the performed manual (e.g., handshape, palm orientation, movement, lo… ▽ More In this paper, we introduce a neural rendering pipeline for transferring the facial expressions, head pose, and body movements of one person in a source video to another in a target video. We apply our method to the challenging case of Sign Language videos: given a source video of a sign language user, we can faithfully transfer the performed manual (e.g., handshape, palm orientation, movement, location) and non-manual (e.g., eye gaze, facial expressions, mouth patterns, head, and body movements) signs to a target video in a photo-realistic manner. Our method can be used for Sign Language Anonymization, Sign Language Production (synthesis module), as well as for reenacting other types of full body activities (dancing, acting performance, exercising, etc.). We conduct detailed qualitative and quantitative evaluations and comparisons, which demonstrate the particularly promising and realistic results that we obtain and the advantages of our method over existing approaches. △ Less

Submitted 30 May, 2023; v1 submitted 3 September, 2022; originally announced September 2022.

Comments: Accepted at AI4CC Workshop at CVPR 2023

arXiv:2207.11094 [pdf, other]

Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos

Authors: Panagiotis P. Filntisis, George Retsinas, Foivos Paraperas-Papantoniou, Athanasios Katsamanis, Anastasios Roussos, Petros Maragos

Abstract: The recent state of the art on monocular 3D face reconstruction from image data has made some impressive advancements, thanks to the advent of Deep Learning. However, it has mostly focused on input coming from a single RGB image, overlooking the following important factors: a) Nowadays, the vast majority of facial image data of interest do not originate from single images but rather from videos, w… ▽ More The recent state of the art on monocular 3D face reconstruction from image data has made some impressive advancements, thanks to the advent of Deep Learning. However, it has mostly focused on input coming from a single RGB image, overlooking the following important factors: a) Nowadays, the vast majority of facial image data of interest do not originate from single images but rather from videos, which contain rich dynamic information. b) Furthermore, these videos typically capture individuals in some form of verbal communication (public talks, teleconferences, audiovisual human-computer interactions, interviews, monologues/dialogues in movies, etc). When existing 3D face reconstruction methods are applied in such videos, the artifacts in the reconstruction of the shape and motion of the mouth area are often severe, since they do not match well with the speech audio. To overcome the aforementioned limitations, we present the first method for visual speech-aware perceptual reconstruction of 3D mouth expressions. We do this by proposing a "lipread" loss, which guides the fitting process so that the elicited perception from the 3D reconstructed talking head resembles that of the original video footage. We demonstrate that, interestingly, the lipread loss is better suited for 3D reconstruction of mouth movements compared to traditional landmark losses, and even direct 3D supervision. Furthermore, the devised method does not rely on any text transcriptions or corresponding audio, rendering it ideal for training in unlabeled datasets. We verify the efficiency of our method through exhaustive objective evaluations on three large-scale datasets, as well as subjective evaluation with two web-based user studies. △ Less

Submitted 22 July, 2022; originally announced July 2022.

arXiv:2112.00585 [pdf, other]

Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos

Authors: Foivos Paraperas Papantoniou, Panagiotis P. Filntisis, Petros Maragos, Anastasios Roussos

Abstract: In this paper, we introduce a novel deep learning method for photo-realistic manipulation of the emotional state of actors in "in-the-wild" videos. The proposed method is based on a parametric 3D face representation of the actor in the input scene that offers a reliable disentanglement of the facial identity from the head pose and facial expressions. It then uses a novel deep domain translation fr… ▽ More In this paper, we introduce a novel deep learning method for photo-realistic manipulation of the emotional state of actors in "in-the-wild" videos. The proposed method is based on a parametric 3D face representation of the actor in the input scene that offers a reliable disentanglement of the facial identity from the head pose and facial expressions. It then uses a novel deep domain translation framework that alters the facial expressions in a consistent and plausible manner, taking into account their dynamics. Finally, the altered facial expressions are used to photo-realistically manipulate the facial region in the input scene based on an especially-designed neural face renderer. To the best of our knowledge, our method is the first to be capable of controlling the actor's facial expressions by even using as a sole input the semantic labels of the manipulated emotions, while at the same time preserving the speech-related lip movements. We conduct extensive qualitative and quantitative evaluations and comparisons, which demonstrate the effectiveness of our approach and the especially promising results that we obtain. Our method opens a plethora of new possibilities for useful applications of neural rendering technologies, ranging from movie post-production and video games to photo-realistic affective avatars. △ Less

Submitted 30 March, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

Comments: CVPR 2022 (oral). Project page: https://foivospar.github.io/NED/

arXiv:2111.07902 [pdf, other]

Deep Semantic Manipulation of Facial Videos

Authors: Girish Kumar Solanki, Anastasios Roussos

Abstract: Editing and manipulating facial features in videos is an interesting and important field of research with a plethora of applications, ranging from movie post-production and visual effects to realistic avatars for video games and virtual assistants. Our method supports semantic video manipulation based on neural rendering and 3D-based facial expression modelling. We focus on interactive manipulatio… ▽ More Editing and manipulating facial features in videos is an interesting and important field of research with a plethora of applications, ranging from movie post-production and visual effects to realistic avatars for video games and virtual assistants. Our method supports semantic video manipulation based on neural rendering and 3D-based facial expression modelling. We focus on interactive manipulation of the videos by altering and controlling the facial expressions, achieving promising photorealistic results. The proposed method is based on a disentangled representation and estimation of the 3D facial shape and activity, providing the user with intuitive and easy-to-use control of the facial expressions in the input video. We also introduce a user-friendly, interactive AI tool that processes human-readable semantic labels about the desired expression manipulations in specific parts of the input video and synthesizes photorealistic manipulated videos. We achieve that by map** the emotion labels to points on the Valence-Arousal space (where Valence quantifies how positive or negative is an emotion and Arousal quantifies the power of the emotion activation), which in turn are mapped to disentangled 3D facial expressions through an especially-designed and trained expression decoder network. The paper presents detailed qualitative and quantitative experiments, which demonstrate the effectiveness of our system and the promising results it achieves. △ Less

Submitted 17 October, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

Comments: 4th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW), European Conference on Computer Vision (ECCV), Tel Aviv, Israel, October 2022

arXiv:2008.03913 [pdf, other]

NFCGate: Opening the Door for NFC Security Research with a Smartphone-Based Toolkit

Authors: Steffen Klee, Alexandros Roussos, Max Maass, Matthias Hollick

Abstract: Near-Field Communication (NFC) is being used in a variety of security-critical applications, from access control to payment systems. However, NFC protocol analysis typically requires expensive or conspicuous dedicated hardware, or is severely limited on smartphones. In 2015, the NFCGate proof of concept aimed at solving this issue by providing capabilities for NFC analysis employing off-the-shelf… ▽ More Near-Field Communication (NFC) is being used in a variety of security-critical applications, from access control to payment systems. However, NFC protocol analysis typically requires expensive or conspicuous dedicated hardware, or is severely limited on smartphones. In 2015, the NFCGate proof of concept aimed at solving this issue by providing capabilities for NFC analysis employing off-the-shelf Android smartphones. In this paper, we present an extended and improved NFC toolkit based on the functionally limited original open-source codebase. With in-flight traffic analysis and modification, relay, and replay features this toolkit turns an off-the-shelf smartphone into a powerful NFC research tool. To support the development of countermeasures against relay attacks, we investigate the latency incurred by NFCGate in different configurations. Our newly implemented features and improvements enable the case study of an award-winning, enterprise-level NFC lock from a well-known European lock vendor, which would otherwise require dedicated hardware. The analysis of the lock reveals several security issues, which were disclosed to the vendor. △ Less

Submitted 10 August, 2020; originally announced August 2020.

Comments: Accepted to Usenix WOOT'20. Source Code and binaries available at https://github.com/nfcgate/nfcgate

arXiv:2006.10500 [pdf, ps, other]

ReenactNet: Real-time Full Head Reenactment

Authors: Mohammad Rami Koujan, Michail Christos Doukas, Anastasios Roussos, Stefanos Zafeiriou

Abstract: Video-to-video synthesis is a challenging problem aiming at learning a translation function between a sequence of semantic maps and a photo-realistic video depicting the characteristics of a driving video. We propose a head-to-head system of our own implementation capable of fully transferring the human head 3D pose, facial expressions and eye gaze from a source to a target actor, while preserving… ▽ More Video-to-video synthesis is a challenging problem aiming at learning a translation function between a sequence of semantic maps and a photo-realistic video depicting the characteristics of a driving video. We propose a head-to-head system of our own implementation capable of fully transferring the human head 3D pose, facial expressions and eye gaze from a source to a target actor, while preserving the identity of the target actor. Our system produces high-fidelity, temporally-smooth and photo-realistic synthetic videos faithfully transferring the human time-varying head attributes from the source to the target actor. Our proposed implementation: 1) works in real time ($\sim 20$ fps), 2) runs on a commodity laptop with a webcam as the only input, 3) is interactive, allowing the participant to drive a target person, e.g. a celebrity, politician, etc, instantly by varying their expressions, head pose, and eye gaze, and visualising the synthesised video concurrently. △ Less

Submitted 21 May, 2020; originally announced June 2020.

Comments: to be published in 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)

arXiv:2006.10499 [pdf, other]

Real-Time Monocular 4D Face Reconstruction using the LSFM models

Authors: Mohammad Rami Koujan, Nikolai Dochev, Anastasios Roussos

Abstract: 4D face reconstruction from a single camera is a challenging task, especially when it is required to be performed in real time. We demonstrate a system of our own implementation that solves this task accurately and runs in real time on a commodity laptop, using a webcam as the only input. Our system is interactive, allowing the user to freely move their head and show various expressions while stan… ▽ More 4D face reconstruction from a single camera is a challenging task, especially when it is required to be performed in real time. We demonstrate a system of our own implementation that solves this task accurately and runs in real time on a commodity laptop, using a webcam as the only input. Our system is interactive, allowing the user to freely move their head and show various expressions while standing in front of the camera. As a result, the put forward system both reconstructs and visualises the identity of the subject in the correct pose along with the acted facial expressions in real-time. The 4D reconstruction in our framework is based on the recently-released Large-Scale Facial Models (LSFM) \cite{LSFM1, LSFM2}, which are the largest-scale 3D Morphable Models of facial shapes ever constructed, based on a dataset of more than 10,000 facial identities from a wide range of gender, age and ethnicity combinations. This is the first real-time demo that gives users the opportunity to test in practice the capabilities of the recently-released Large-Scale Facial Models (LSFM) △ Less

Submitted 21 May, 2020; originally announced June 2020.

Comments: Published in Proceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production

arXiv:2006.10199 [pdf, other]

doi 10.1109/TBIOM.2021.3049576

Head2Head++: Deep Facial Attributes Re-Targeting

Authors: Michail Christos Doukas, Mohammad Rami Koujan, Viktoriia Sharmanska, Anastasios Roussos

Abstract: Facial video re-targeting is a challenging problem aiming to modify the facial attributes of a target subject in a seamless manner by a driving monocular sequence. We leverage the 3D geometry of faces and Generative Adversarial Networks (GANs) to design a novel deep learning architecture for the task of facial and head reenactment. Our method is different to purely 3D model-based approaches, or re… ▽ More Facial video re-targeting is a challenging problem aiming to modify the facial attributes of a target subject in a seamless manner by a driving monocular sequence. We leverage the 3D geometry of faces and Generative Adversarial Networks (GANs) to design a novel deep learning architecture for the task of facial and head reenactment. Our method is different to purely 3D model-based approaches, or recent image-based methods that use Deep Convolutional Neural Networks (DCNNs) to generate individual frames. We manage to capture the complex non-rigid facial motion from the driving monocular performances and synthesise temporally consistent videos, with the aid of a sequential Generator and an ad-hoc Dynamics Discriminator network. We conduct a comprehensive set of quantitative and qualitative tests and demonstrate experimentally that our proposed method can successfully transfer facial expressions, head pose and eye gaze from a source video to a target subject, in a photo-realistic and faithful fashion, better than other state-of-the-art methods. Most importantly, our system performs end-to-end reenactment in nearly real-time speed (18 fps). △ Less

Submitted 28 September, 2021; v1 submitted 17 June, 2020; originally announced June 2020.

Comments: Published in IEEE Transactions on Biometrics, Behavior, and Identity Science (Volume: 3, Issue: 1, Jan. 2021)

arXiv:2005.10954 [pdf, other]

Head2Head: Video-based Neural Head Synthesis

Authors: Mohammad Rami Koujan, Michail Christos Doukas, Anastasios Roussos, Stefanos Zafeiriou

Abstract: In this paper, we propose a novel machine learning architecture for facial reenactment. In particular, contrary to the model-based approaches or recent frame-based methods that use Deep Convolutional Neural Networks (DCNNs) to generate individual frames, we propose a novel method that (a) exploits the special structure of facial motion (paying particular attention to mouth motion) and (b) enforces… ▽ More In this paper, we propose a novel machine learning architecture for facial reenactment. In particular, contrary to the model-based approaches or recent frame-based methods that use Deep Convolutional Neural Networks (DCNNs) to generate individual frames, we propose a novel method that (a) exploits the special structure of facial motion (paying particular attention to mouth motion) and (b) enforces temporal consistency. We demonstrate that the proposed method can transfer facial expressions, pose and gaze of a source actor to a target video in a photo-realistic fashion more accurately than state-of-the-art methods. △ Less

Submitted 21 May, 2020; originally announced May 2020.

Comments: To be published in 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)

arXiv:2005.07298 [pdf, other]

DeepFaceFlow: In-the-wild Dense 3D Facial Motion Estimation

Authors: Mohammad Rami Koujan, Anastasios Roussos, Stefanos Zafeiriou

Abstract: Dense 3D facial motion capture from only monocular in-the-wild pairs of RGB images is a highly challenging problem with numerous applications, ranging from facial expression recognition to facial reenactment. In this work, we propose DeepFaceFlow, a robust, fast, and highly-accurate framework for the dense estimation of 3D non-rigid facial flow between pairs of monocular images. Our DeepFaceFlow f… ▽ More Dense 3D facial motion capture from only monocular in-the-wild pairs of RGB images is a highly challenging problem with numerous applications, ranging from facial expression recognition to facial reenactment. In this work, we propose DeepFaceFlow, a robust, fast, and highly-accurate framework for the dense estimation of 3D non-rigid facial flow between pairs of monocular images. Our DeepFaceFlow framework was trained and tested on two very large-scale facial video datasets, one of them of our own collection and annotation, with the aid of occlusion-aware and 3D-based loss function. We conduct comprehensive experiments probing different aspects of our approach and demonstrating its improved performance against state-of-the-art flow and 3D reconstruction methods. Furthermore, we incorporate our framework in a full-head state-of-the-art facial video synthesis method and demonstrate the ability of our method in better representing and capturing the facial dynamics, resulting in a highly-realistic facial video synthesis. Given registered pairs of images, our framework generates 3D flow maps at ~60 fps. △ Less

Submitted 14 May, 2020; originally announced May 2020.

Comments: to be published in the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 2020

arXiv:2005.05509 [pdf, other]

Real-time Facial Expression Recognition "In The Wild'' by Disentangling 3D Expression from Identity

Authors: Mohammad Rami Koujan, Luma Alharbawee, Giorgos Giannakakis, Nicolas Pugeault, Anastasios Roussos

Abstract: Human emotions analysis has been the focus of many studies, especially in the field of Affective Computing, and is important for many applications, e.g. human-computer intelligent interaction, stress analysis, interactive games, animations, etc. Solutions for automatic emotion analysis have also benefited from the development of deep learning approaches and the availability of vast amount of visua… ▽ More Human emotions analysis has been the focus of many studies, especially in the field of Affective Computing, and is important for many applications, e.g. human-computer intelligent interaction, stress analysis, interactive games, animations, etc. Solutions for automatic emotion analysis have also benefited from the development of deep learning approaches and the availability of vast amount of visual facial data on the internet. This paper proposes a novel method for human emotion recognition from a single RGB image. We construct a large-scale dataset of facial videos (\textbf{FaceVid}), rich in facial dynamics, identities, expressions, appearance and 3D pose variations. We use this dataset to train a deep Convolutional Neural Network for estimating expression parameters of a 3D Morphable Model and combine it with an effective back-end emotion classifier. Our proposed framework runs at 50 frames per second and is capable of robustly estimating parameters of 3D expression variation and accurately recognizing facial expressions from in-the-wild images. We present extensive experimental evaluation that shows that the proposed method outperforms the compared techniques in estimating the 3D expression parameters and achieves state-of-the-art performance in recognising the basic emotions from facial images, as well as recognising stress from facial videos. %compared to the current state of the art in emotion recognition from facial images. △ Less

Submitted 11 May, 2020; originally announced May 2020.

Comments: to be published in 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)

Showing 1–14 of 14 results for author: Roussos, A