Search | arXiv e-print repository

EEG-Features for Generalized Deepfake Detection

Authors: Arian Beckmann, Tilman Stephani, Felix Klotzsche, Yonghao Chen, Simon M. Hofmann, Arno Villringer, Michael Gaebler, Vadim Nikulin, Sebastian Bosse, Peter Eisert, Anna Hilsmann

Abstract: Since the advent of Deepfakes in digital media, the development of robust and reliable detection mechanism is urgently called for. In this study, we explore a novel approach to Deepfake detection by utilizing electroencephalography (EEG) measured from the neural processing of a human participant who viewed and categorized Deepfake stimuli from the FaceForensics++ datset. These measurements serve a… ▽ More Since the advent of Deepfakes in digital media, the development of robust and reliable detection mechanism is urgently called for. In this study, we explore a novel approach to Deepfake detection by utilizing electroencephalography (EEG) measured from the neural processing of a human participant who viewed and categorized Deepfake stimuli from the FaceForensics++ datset. These measurements serve as input features to a binary support vector classifier, trained to discriminate between real and manipulated facial images. We examine whether EEG data can inform Deepfake detection and also if it can provide a generalized representation capable of identifying Deepfakes beyond the training domain. Our preliminary results indicate that human neural processing signals can be successfully integrated into Deepfake detection frameworks and hint at the potential for a generalized neural representation of artifacts in computer generated faces. Moreover, our study provides next steps towards the understanding of how digital realism is embedded in the human cognitive system, possibly enabling the development of more realistic digital avatars in the future. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2404.10625 [pdf, other]

Gaussian Splatting Decoder for 3D-aware Generative Adversarial Networks

Authors: Florian Barthel, Arian Beckmann, Wieland Morgenstern, Anna Hilsmann, Peter Eisert

Abstract: NeRF-based 3D-aware Generative Adversarial Networks (GANs) like EG3D or GIRAFFE have shown very high rendering quality under large representational variety. However, rendering with Neural Radiance Fields poses challenges for 3D applications: First, the significant computational demands of NeRF rendering preclude its use on low-power devices, such as mobiles and VR/AR headsets. Second, implicit rep… ▽ More NeRF-based 3D-aware Generative Adversarial Networks (GANs) like EG3D or GIRAFFE have shown very high rendering quality under large representational variety. However, rendering with Neural Radiance Fields poses challenges for 3D applications: First, the significant computational demands of NeRF rendering preclude its use on low-power devices, such as mobiles and VR/AR headsets. Second, implicit representations based on neural networks are difficult to incorporate into explicit 3D scenes, such as VR environments or video games. 3D Gaussian Splatting (3DGS) overcomes these limitations by providing an explicit 3D representation that can be rendered efficiently at high frame rates. In this work, we present a novel approach that combines the high rendering quality of NeRF-based 3D-aware GANs with the flexibility and computational advantages of 3DGS. By training a decoder that maps implicit NeRF representations to explicit 3D Gaussian Splatting attributes, we can integrate the representational diversity and quality of 3D GANs into the ecosystem of 3D Gaussian Splatting for the first time. Additionally, our approach allows for a high resolution GAN inversion and real-time GAN editing with 3D Gaussian Splatting scenes. Project page: florian-barthel.github.io/gaussian_decoder △ Less

Submitted 17 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

Comments: CVPRW

arXiv:2404.10527 [pdf, other]

SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments

Authors: Niklas Gard, Anna Hilsmann, Peter Eisert

Abstract: In this paper, we present SPVLoc, a global indoor localization method that accurately determines the six-dimensional (6D) camera pose of a query image and requires minimal scene-specific prior knowledge and no scene-specific training. Our approach employs a novel matching procedure to localize the perspective camera's viewport, given as an RGB image, within a set of panoramic semantic layout repre… ▽ More In this paper, we present SPVLoc, a global indoor localization method that accurately determines the six-dimensional (6D) camera pose of a query image and requires minimal scene-specific prior knowledge and no scene-specific training. Our approach employs a novel matching procedure to localize the perspective camera's viewport, given as an RGB image, within a set of panoramic semantic layout representations of the indoor environment. The panoramas are rendered from an untextured 3D reference model, which only comprises approximate structural information about room shapes, along with door and window annotations. We demonstrate that a straightforward convolutional network structure can successfully achieve image-to-panorama and ultimately image-to-model matching. Through a viewport classification score, we rank reference panoramas and select the best match for the query image. Then, a 6D relative pose is estimated between the chosen panorama and query image. Our experiments demonstrate that this approach not only efficiently bridges the domain gap but also generalizes well to previously unseen scenes that are not part of the training data. Moreover, it achieves superior localization accuracy compared to the state of the art methods and also estimates more degrees of freedom of the camera pose. We will make our source code publicly available at https://github.com/fraunhoferhhi/spvloc . △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: This submission includes the paper and supplementary material. 24 pages, 11 figures

arXiv:2403.12050 [pdf, other]

Efficient and Accurate Hyperspectral Image Demosaicing with Neural Network Architectures

Authors: Eric L. Wisotzky, Lara Wallburg, Anna Hilsmann, Peter Eisert, Thomas Wittenberg, Stephan Göb

Abstract: Neural network architectures for image demosaicing have been become more and more complex. This results in long training periods of such deep networks and the size of the networks is huge. These two factors prevent practical implementation and usage of the networks in real-time platforms, which generally only have limited resources. This study investigates the effectiveness of neural network archi… ▽ More Neural network architectures for image demosaicing have been become more and more complex. This results in long training periods of such deep networks and the size of the networks is huge. These two factors prevent practical implementation and usage of the networks in real-time platforms, which generally only have limited resources. This study investigates the effectiveness of neural network architectures in hyperspectral image demosaicing. We introduce a range of network models and modifications, and compare them with classical interpolation methods and existing reference network approaches. The aim is to identify robust and efficient performing network architectures. Our evaluation is conducted on two datasets, "SimpleData" and "SimRealData," representing different degrees of realism in multispectral filter array (MSFA) data. The results indicate that our networks outperform or match reference models in both datasets demonstrating exceptional performance. Notably, our approach focuses on achieving correct spectral reconstruction rather than just visual appeal, and this emphasis is supported by quantitative and qualitative assessments. Furthermore, our findings suggest that efficient demosaicing solutions, which require fewer parameters, are essential for practical applications. This research contributes valuable insights into hyperspectral imaging and its potential applications in various fields, including medical imaging. △ Less

Submitted 21 December, 2023; originally announced March 2024.

Comments: 10 pages, accepted at Visapp 2024

arXiv:2403.04380 [pdf, other]

doi 10.2312/vmv.20231237

Video-Driven Animation of Neural Head Avatars

Authors: Wolfgang Paier, Paul Hinzer, Anna Hilsmann, Peter Eisert

Abstract: We present a new approach for video-driven animation of high-quality neural 3D head models, addressing the challenge of person-independent animation from video input. Typically, high-quality generative models are learned for specific individuals from multi-view video footage, resulting in person-specific latent representations that drive the generation process. In order to achieve person-independe… ▽ More We present a new approach for video-driven animation of high-quality neural 3D head models, addressing the challenge of person-independent animation from video input. Typically, high-quality generative models are learned for specific individuals from multi-view video footage, resulting in person-specific latent representations that drive the generation process. In order to achieve person-independent animation from video input, we introduce an LSTM-based animation network capable of translating person-independent expression features into personalized animation parameters of person-specific 3D head models. Our approach combines the advantages of personalized head models (high quality and realism) with the convenience of video-driven animation employing multi-person facial performance capture. We demonstrate the effectiveness of our approach on synthesized animations with high quality based on different source videos as well as an ablation study. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Journal ref: Proc. International Workshop on Vision, Modeling, and Visualization (VMV), Braunschweig, Germany, Sep. 2023

arXiv:2402.10061 [pdf, other]

doi 10.1109/CVPRW59228.2023.00418

X-maps: Direct Depth Lookup for Event-based Structured Light Systems

Authors: Wieland Morgenstern, Niklas Gard, Simon Baumann, Anna Hilsmann, Peter Eisert

Abstract: We present a new approach to direct depth estimation for Spatial Augmented Reality (SAR) applications using event cameras. These dynamic vision sensors are a great fit to be paired with laser projectors for depth estimation in a structured light approach. Our key contributions involve a conversion of the projector time map into a rectified X-map, capturing x-axis correspondences for incoming event… ▽ More We present a new approach to direct depth estimation for Spatial Augmented Reality (SAR) applications using event cameras. These dynamic vision sensors are a great fit to be paired with laser projectors for depth estimation in a structured light approach. Our key contributions involve a conversion of the projector time map into a rectified X-map, capturing x-axis correspondences for incoming events and enabling direct disparity lookup without any additional search. Compared to previous implementations, this significantly simplifies depth estimation, making it more efficient, while the accuracy is similar to the time map-based process. Moreover, we compensate non-linear temporal behavior of cheap laser projectors by a simple time map calibration, resulting in improved performance and increased depth estimation accuracy. Since depth estimation is executed by two lookups only, it can be executed almost instantly (less than 3 ms per frame with a Python implementation) for incoming events. This allows for real-time interactivity and responsiveness, which makes our approach especially suitable for SAR experiences where low latency, high frame rates and direct feedback are crucial. We present valuable insights gained into data transformed into X-maps and evaluate our depth from disparity estimation against the state of the art time map-based results. Additional results and code are available on our project page: https://fraunhoferhhi.github.io/X-maps/ △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: Accepted at the CVPR 2023 Workshop on Event-based Vision: https://tub-rip.github.io/eventvision2023/

Journal ref: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 2023, pp. 4007-4015

arXiv:2401.09428 [pdf, other]

Multispectral Stereo-Image Fusion for 3D Hyperspectral Scene Reconstruction

Authors: Eric L. Wisotzky, Jost Triller, Anna Hilsmann, Peter Eisert

Abstract: Spectral imaging enables the analysis of optical material properties that are invisible to the human eye. Different spectral capturing setups, e.g., based on filter-wheel, push-broom, line-scanning, or mosaic cameras, have been introduced in the last years to support a wide range of applications in agriculture, medicine, and industrial surveillance. However, these systems often suffer from differe… ▽ More Spectral imaging enables the analysis of optical material properties that are invisible to the human eye. Different spectral capturing setups, e.g., based on filter-wheel, push-broom, line-scanning, or mosaic cameras, have been introduced in the last years to support a wide range of applications in agriculture, medicine, and industrial surveillance. However, these systems often suffer from different disadvantages, such as lack of real-time capability, limited spectral coverage or low spatial resolution. To address these drawbacks, we present a novel approach combining two calibrated multispectral real-time capable snapshot cameras, covering different spectral ranges, into a stereo-system. Therefore, a hyperspectral data-cube can be continuously captured. The combined use of different multispectral snapshot cameras enables both 3D reconstruction and spectral analysis. Both captured images are demosaicked avoiding spatial resolution loss. We fuse the spectral data from one camera into the other to receive a spatially and spectrally high resolution video stream. Experiments demonstrate the feasibility of this approach and the system is investigated with regard to its applicability for surgical assistance monitoring. △ Less

Submitted 15 December, 2023; originally announced January 2024.

Comments: VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

arXiv:2312.13299 [pdf, other]

Compact 3D Scene Representation via Self-Organizing Gaussian Grids

Authors: Wieland Morgenstern, Florian Barthel, Anna Hilsmann, Peter Eisert

Abstract: 3D Gaussian Splatting has recently emerged as a highly promising technique for modeling of static 3D scenes. In contrast to Neural Radiance Fields, it utilizes efficient rasterization allowing for very fast rendering at high-quality. However, the storage size is significantly higher, which hinders practical deployment, e.g. on resource constrained devices. In this paper, we introduce a compact sce… ▽ More 3D Gaussian Splatting has recently emerged as a highly promising technique for modeling of static 3D scenes. In contrast to Neural Radiance Fields, it utilizes efficient rasterization allowing for very fast rendering at high-quality. However, the storage size is significantly higher, which hinders practical deployment, e.g. on resource constrained devices. In this paper, we introduce a compact scene representation organizing the parameters of 3D Gaussian Splatting (3DGS) into a 2D grid with local homogeneity, ensuring a drastic reduction in storage requirements without compromising visual quality during rendering. Central to our idea is the explicit exploitation of perceptual redundancies present in natural scenes. In essence, the inherent nature of a scene allows for numerous permutations of Gaussian parameters to equivalently represent it. To this end, we propose a novel highly parallel algorithm that regularly arranges the high-dimensional Gaussian parameters into a 2D grid while preserving their neighborhood structure. During training, we further enforce local smoothness between the sorted parameters in the grid. The uncompressed Gaussians use the same structure as 3DGS, ensuring a seamless integration with established renderers. Our method achieves a reduction factor of 17x to 42x in size for complex scenes with no increase in training time, marking a substantial leap forward in the domain of 3D scene distribution and consumption. Additional information can be found on our project page: https://fraunhoferhhi.github.io/Self-Organizing-Gaussians/ △ Less

Submitted 2 May, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: Added compression of spherical harmonics, updated compression method with improved results (all attributes compressed with JPEG XL now), added qualitative comparison of additional scenes, moved compression explanation and comparison to main paper, added comparison with "Making Gaussian Splats smaller"

arXiv:2312.08111 [pdf, other]

Towards Better Morphed Face Images without Ghosting Artifacts

Authors: Clemens Seibold, Anna Hilsmann, Peter Eisert

Abstract: Automatic generation of morphed face images often produces ghosting artifacts due to poorly aligned structures in the input images. Manual processing can mitigate these artifacts. However, this is not feasible for the generation of large datasets, which are required for training and evaluating robust morphing attack detectors. In this paper, we propose a method for automatic prevention of ghosting… ▽ More Automatic generation of morphed face images often produces ghosting artifacts due to poorly aligned structures in the input images. Manual processing can mitigate these artifacts. However, this is not feasible for the generation of large datasets, which are required for training and evaluating robust morphing attack detectors. In this paper, we propose a method for automatic prevention of ghosting artifacts based on a pixel-wise alignment during morph generation. We evaluate our proposed method on state-of-the-art detectors and show that our morphs are harder to detect, particularly, when combined with style-transfer-based improvement of low-level image characteristics. Furthermore, we show that our approach does not impair the biometric quality, which is essential for high quality morphs. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: Accepted at VISAPP 2024

arXiv:2312.05330 [pdf, ps, other]

Multi-view Inversion for 3D-aware Generative Adversarial Networks

Authors: Florian Barthel, Anna Hilsmann, Peter Eisert

Abstract: Current 3D GAN inversion methods for human heads typically use only one single frontal image to reconstruct the whole 3D head model. This leaves out meaningful information when multi-view data or dynamic videos are available. Our method builds on existing state-of-the-art 3D GAN inversion techniques to allow for consistent and simultaneous inversion of multiple views of the same subject. We employ… ▽ More Current 3D GAN inversion methods for human heads typically use only one single frontal image to reconstruct the whole 3D head model. This leaves out meaningful information when multi-view data or dynamic videos are available. Our method builds on existing state-of-the-art 3D GAN inversion techniques to allow for consistent and simultaneous inversion of multiple views of the same subject. We employ a multi-latent extension to handle inconsistencies present in dynamic face videos to re-synthesize consistent 3D representations from the sequence. As our method uses additional information about the target subject, we observe significant enhancements in both geometric accuracy and image quality, particularly when rendering from wide viewing angles. Moreover, we demonstrate the editability of our inverted 3D renderings, which distinguishes them from NeRF-based scene reconstructions. △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2311.14981 [pdf, other]

Multi-task Planar Reconstruction with Feature War** Guidance

Authors: Luan Wei, Anna Hilsmann, Peter Eisert

Abstract: Piece-wise planar 3D reconstruction simultaneously segments plane instances and recovers their 3D plane parameters from an image, which is particularly useful for indoor or man-made environments. Efficient reconstruction of 3D planes coupled with semantic predictions offers advantages for a wide range of applications requiring scene understanding and concurrent spatial map**. However, most exist… ▽ More Piece-wise planar 3D reconstruction simultaneously segments plane instances and recovers their 3D plane parameters from an image, which is particularly useful for indoor or man-made environments. Efficient reconstruction of 3D planes coupled with semantic predictions offers advantages for a wide range of applications requiring scene understanding and concurrent spatial map**. However, most existing planar reconstruction models either neglect semantic predictions or do not run efficiently enough for real-time applications. We introduce SOLOPlanes, a real-time planar reconstruction model based on a modified instance segmentation architecture which simultaneously predicts semantics for each plane instance, along with plane parameters and piece-wise plane instance masks. We achieve an improvement in instance mask segmentation by including multi-view guidance for plane predictions in the training process. This cross-task improvement, training for plane prediction but improving the mask segmentation, is due to the nature of feature sharing in multi-task learning. Our model simultaneously predicts semantics using single images at inference time, while achieving real-time predictions at 43 FPS. △ Less

Submitted 21 December, 2023; v1 submitted 25 November, 2023; originally announced November 2023.

Comments: For code, see https://github.com/fraunhoferhhi/SOLOPlanes

Journal ref: VISAPP 2024

arXiv:2311.03140 [pdf, other]

doi 10.5220/0012373800003660

Animating NeRFs from Texture Space: A Framework for Pose-Dependent Rendering of Human Performances

Authors: Paul Knoll, Wieland Morgenstern, Anna Hilsmann, Peter Eisert

Abstract: Creating high-quality controllable 3D human models from multi-view RGB videos poses a significant challenge. Neural radiance fields (NeRFs) have demonstrated remarkable quality in reconstructing and free-viewpoint rendering of static as well as dynamic scenes. The extension to a controllable synthesis of dynamic human performances poses an exciting research question. In this paper, we introduce a… ▽ More Creating high-quality controllable 3D human models from multi-view RGB videos poses a significant challenge. Neural radiance fields (NeRFs) have demonstrated remarkable quality in reconstructing and free-viewpoint rendering of static as well as dynamic scenes. The extension to a controllable synthesis of dynamic human performances poses an exciting research question. In this paper, we introduce a novel NeRF-based framework for pose-dependent rendering of human performances. In our approach, the radiance field is warped around an SMPL body mesh, thereby creating a new surface-aligned representation. Our representation can be animated through skeletal joint parameters that are provided to the NeRF in addition to the viewpoint for pose dependent appearances. To achieve this, our representation includes the corresponding 2D UV coordinates on the mesh texture map and the distance between the query point and the mesh. To enable efficient learning despite map** ambiguities and random visual variations, we introduce a novel remap** process that refines the mapped coordinates. Experiments demonstrate that our approach results in high-quality renderings for novel-view and novel-pose synthesis. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Journal ref: Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP; ISBN 978-989-758-679-8; ISSN 2184-4321, SciTePress, pages 404-413, 2024

arXiv:2310.07534 [pdf, other]

doi 10.1109/ICDMW60847.2023.00122

Human-Centered Evaluation of XAI Methods

Authors: Karam Dawoud, Wojciech Samek, Peter Eisert, Sebastian Lapuschkin, Sebastian Bosse

Abstract: In the ever-evolving field of Artificial Intelligence, a critical challenge has been to decipher the decision-making processes within the so-called "black boxes" in deep learning. Over recent years, a plethora of methods have emerged, dedicated to explaining decisions across diverse tasks. Particularly in tasks like image classification, these methods typically identify and emphasize the pivotal p… ▽ More In the ever-evolving field of Artificial Intelligence, a critical challenge has been to decipher the decision-making processes within the so-called "black boxes" in deep learning. Over recent years, a plethora of methods have emerged, dedicated to explaining decisions across diverse tasks. Particularly in tasks like image classification, these methods typically identify and emphasize the pivotal pixels that most influence a classifier's prediction. Interestingly, this approach mirrors human behavior: when asked to explain our rationale for classifying an image, we often point to the most salient features or aspects. Capitalizing on this parallel, our research embarked on a user-centric study. We sought to objectively measure the interpretability of three leading explanation methods: (1) Prototypical Part Network, (2) Occlusion, and (3) Layer-wise Relevance Propagation. Intriguingly, our results highlight that while the regions spotlighted by these methods can vary widely, they all offer humans a nearly equivalent depth of understanding. This enables users to discern and categorize images efficiently, reinforcing the value of these methods in enhancing AI transparency. △ Less

Submitted 16 October, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

Journal ref: ICDMW (2023) 912-921

arXiv:2310.03615 [pdf, other]

doi 10.1109/TVCG.2024.3372117

Animatable Virtual Humans: Learning pose-dependent human representations in UV space for interactive performance synthesis

Authors: Wieland Morgenstern, Milena T. Bagdasarian, Anna Hilsmann, Peter Eisert

Abstract: We propose a novel representation of virtual humans for highly realistic real-time animation and rendering in 3D applications. We learn pose dependent appearance and geometry from highly accurate dynamic mesh sequences obtained from state-of-the-art multiview-video reconstruction. Learning pose-dependent appearance and geometry from mesh sequences poses significant challenges, as it requires the n… ▽ More We propose a novel representation of virtual humans for highly realistic real-time animation and rendering in 3D applications. We learn pose dependent appearance and geometry from highly accurate dynamic mesh sequences obtained from state-of-the-art multiview-video reconstruction. Learning pose-dependent appearance and geometry from mesh sequences poses significant challenges, as it requires the network to learn the intricate shape and articulated motion of a human body. However, statistical body models like SMPL provide valuable a-priori knowledge which we leverage in order to constrain the dimension of the search space enabling more efficient and targeted learning and define pose-dependency. Instead of directly learning absolute pose-dependent geometry, we learn the difference between the observed geometry and the fitted SMPL model. This allows us to encode both pose-dependent appearance and geometry in the consistent UV space of the SMPL model. This approach not only ensures a high level of realism but also facilitates streamlined processing and rendering of virtual humans in real-time scenarios. △ Less

Submitted 5 October, 2023; originally announced October 2023.

Journal ref: 2024 TVCG Special Issue on the 2024 IEEE Conference on Virtual Reality and 3D User Interfaces (VR)

arXiv:2308.16819 [pdf, other]

BTSeg: Barlow Twins Regularization for Domain Adaptation in Semantic Segmentation

Authors: Johannes Künzel, Anna Hilsmann, Peter Eisert

Abstract: Semantic image segmentation is particularly vital for the advancement of autonomous vehicle technologies. However, this domain faces substantial challenges under adverse conditions like rain or darkness, which remain underrepresented in most datasets. The generation of additional training data for these scenarios is not only costly but also fraught with potential inaccuracies, largely attributable… ▽ More Semantic image segmentation is particularly vital for the advancement of autonomous vehicle technologies. However, this domain faces substantial challenges under adverse conditions like rain or darkness, which remain underrepresented in most datasets. The generation of additional training data for these scenarios is not only costly but also fraught with potential inaccuracies, largely attributable to the aleatoric uncertainty inherent in such conditions. We introduce BTSeg, an innovative, semi-supervised training approach enhancing semantic segmentation models in order to effectively handle a range of adverse conditions without requiring the creation of extensive new datasets. BTSeg employs a novel application of the Barlow Twins loss, a concept borrowed from unsupervised learning. The original Barlow Twins approach uses stochastic augmentations in order to learn useful representations from unlabeled data without the need for external labels. In our approach, we regard images captured at identical locations but under varying adverse conditions as manifold representation of the same scene (which could be interpreted as "natural augmentations"), thereby enabling the model to conceptualize its understanding of the environment. We evaluate our approach on the ACDC dataset, where it performs favorably when compared to the current state-of-the-art methods, while also being simpler to implement and train. For the new challenging ACG benchmark it shows cutting-edge performance, demonstrating its robustness and generalization capabilities. We will make the code publicly available post-acceptance. △ Less

Submitted 20 November, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

Comments: 10 pages, slightly improved results, memory-efficient training on full-resolution

arXiv:2306.14361 [pdf, other]

A differentiable Gaussian Prototype Layer for explainable Segmentation

Authors: Michael Gerstenberger, Steffen Maaß, Peter Eisert, Sebastian Bosse

Abstract: We introduce a Gaussian Prototype Layer for gradient-based prototype learning and demonstrate two novel network architectures for explainable segmentation one of which relies on region proposals. Both models are evaluated on agricultural datasets. While Gaussian Mixture Models (GMMs) have been used to model latent distributions of neural networks before, they are typically fitted using the EM algo… ▽ More We introduce a Gaussian Prototype Layer for gradient-based prototype learning and demonstrate two novel network architectures for explainable segmentation one of which relies on region proposals. Both models are evaluated on agricultural datasets. While Gaussian Mixture Models (GMMs) have been used to model latent distributions of neural networks before, they are typically fitted using the EM algorithm. Instead, the proposed prototype layer relies on gradient-based optimization and hence allows for end-to-end training. This facilitates development and allows to use the full potential of a trainable deep feature extractor. We show that it can be used as a novel building block for explainable neural networks. We employ our Gaussian Prototype Layer in (1) a model where prototypes are detected in the latent grid and (2) a model inspired by Fast-RCNN with SLIC superpixels as region proposals. The earlier achieves a similar performance as compared to the state-of-the art while the latter has the benefit of a more precise prototype localization that comes at the cost of slightly lower accuracies. By introducing a gradient-based GMM layer we combine the benefits of end-to-end training with the simplicity and theoretical foundation of GMMs which will allow to adapt existing semi-supervised learning strategies for prototypical part models in future. △ Less

Submitted 25 June, 2023; originally announced June 2023.

arXiv:2306.10006 [pdf, other]

Unsupervised Learning of Style-Aware Facial Animation from Real Acting Performances

Authors: Wolfgang Paier, Anna Hilsmann, Peter Eisert

Abstract: This paper presents a novel approach for text/speech-driven animation of a photo-realistic head model based on blend-shape geometry, dynamic textures, and neural rendering. Training a VAE for geometry and texture yields a parametric model for accurate capturing and realistic synthesis of facial expressions from a latent feature vector. Our animation method is based on a conditional CNN that transf… ▽ More This paper presents a novel approach for text/speech-driven animation of a photo-realistic head model based on blend-shape geometry, dynamic textures, and neural rendering. Training a VAE for geometry and texture yields a parametric model for accurate capturing and realistic synthesis of facial expressions from a latent feature vector. Our animation method is based on a conditional CNN that transforms text or speech into a sequence of animation parameters. In contrast to previous approaches, our animation model learns disentangling/synthesizing different acting-styles in an unsupervised manner, requiring only phonetic labels that describe the content of training sequences. For realistic real-time rendering, we train a U-Net that refines rasterization-based renderings by computing improved pixel colors and a foreground matte. We compare our framework qualitatively/quantitatively against recent methods for head modeling as well as facial animation and evaluate the perceived rendering/animation quality in a user-study, which indicates large improvements compared to state-of-the-art approaches △ Less

Submitted 1 September, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

Comments: 16 pages, submitted to Graphical Models (Feb 2023)

arXiv:2306.01642 [pdf, other]

doi 10.23919/MVA57639.2023.10215746

Automatic Reconstruction of Semantic 3D Models from 2D Floor Plans

Authors: Aleixo Cambeiro Barreiro, Mariusz Trzeciakiewicz, Anna Hilsmann, Peter Eisert

Abstract: Digitalization of existing buildings and the creation of 3D BIM models for them has become crucial for many tasks. Of particular importance are floor plans, which contain information about building layouts and are vital for processes such as construction, maintenance or refurbishing. However, this data is not always available in digital form, especially for older buildings constructed before CAD t… ▽ More Digitalization of existing buildings and the creation of 3D BIM models for them has become crucial for many tasks. Of particular importance are floor plans, which contain information about building layouts and are vital for processes such as construction, maintenance or refurbishing. However, this data is not always available in digital form, especially for older buildings constructed before CAD tools were widely available, or lacks semantic information. The digitalization of such information usually requires manual work of an expert that must reconstruct the layouts by hand, which is a cumbersome and error-prone process. In this paper, we present a pipeline for reconstruction of vectorized 3D models from scanned 2D plans, aiming at increasing the efficiency of this process. The method presented achieves state-of-the-art results in the public dataset CubiCasa5k, and shows good generalization to different types of plans. Our vectorization approach is particularly effective, outperforming previous methods. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: 5 pages, 1 figure

arXiv:2305.06483 [pdf, other]

Towards L-System Captioning for Tree Reconstruction

Authors: Jannes S. Magnusson, Anna Hilsmann, Peter Eisert

Abstract: This work proposes a novel concept for tree and plant reconstruction by directly inferring a Lindenmayer-System (L-System) word representation from image data in an image captioning approach. We train a model end-to-end which is able to translate given images into L-System words as a description of the displayed tree. To prove this concept, we demonstrate the applicability on 2D tree topologies. T… ▽ More This work proposes a novel concept for tree and plant reconstruction by directly inferring a Lindenmayer-System (L-System) word representation from image data in an image captioning approach. We train a model end-to-end which is able to translate given images into L-System words as a description of the displayed tree. To prove this concept, we demonstrate the applicability on 2D tree topologies. Transferred to real image data, this novel idea could lead to more efficient, accurate and semantically meaningful tree and plant reconstruction without using error-prone point cloud extraction, and other processes usually utilized in tree reconstruction. Furthermore, this approach bypasses the need for a predefined L-System grammar and enables species-specific L-System inference without biological knowledge. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Comments: Eurographics 2023

ACM Class: I.4.5

arXiv:2305.05282 [pdf, other]

Fooling State-of-the-Art Deepfake Detection with High-Quality Deepfakes

Authors: Arian Beckmann, Anna Hilsmann, Peter Eisert

Abstract: Due to the rising threat of deepfakes to security and privacy, it is most important to develop robust and reliable detectors. In this paper, we examine the need for high-quality samples in the training datasets of such detectors. Accordingly, we show that deepfake detectors proven to generalize well on multiple research datasets still struggle in real-world scenarios with well-crafted fakes. First… ▽ More Due to the rising threat of deepfakes to security and privacy, it is most important to develop robust and reliable detectors. In this paper, we examine the need for high-quality samples in the training datasets of such detectors. Accordingly, we show that deepfake detectors proven to generalize well on multiple research datasets still struggle in real-world scenarios with well-crafted fakes. First, we propose a novel autoencoder for face swap** alongside an advanced face blending technique, which we utilize to generate 90 high-quality deepfakes. Second, we feed those fakes to a state-of-the-art detector, causing its performance to decrease drastically. Moreover, we fine-tune the detector on our fakes and demonstrate that they contain useful clues for the detection of manipulations. Overall, our results provide insights into the generalization of deepfake detectors and suggest that their training datasets should be complemented by high-quality fakes since training on mere research data is insufficient. △ Less

Submitted 16 May, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: Accepted at IH&MMSec '23

arXiv:2303.02978 [pdf, other]

doi 10.5220/0011779900003417

System for 3D Acquisition and 3D Reconstruction using Structured Light for Sewer Line Inspection

Authors: Johannes Künzel, Darko Vehar, Rico Nestler, Karl-Heinz Franke, Anna Hilsmann, Peter Eisert

Abstract: The assessment of sewer pipe systems is a highly important, but at the same time cumbersome and error-prone task. We introduce an innovative system based on single-shot structured light modules that facilitates the detection and classification of spatial defects like jutting intrusions, spallings, or misaligned joints. This system creates highly accurate 3D measurements with sub-millimeter resolut… ▽ More The assessment of sewer pipe systems is a highly important, but at the same time cumbersome and error-prone task. We introduce an innovative system based on single-shot structured light modules that facilitates the detection and classification of spatial defects like jutting intrusions, spallings, or misaligned joints. This system creates highly accurate 3D measurements with sub-millimeter resolution of pipe surfaces and fuses them into a holistic 3D model. The benefit of such a holistic 3D model is twofold: on the one hand, it facilitates the accurate manual sewer pipe assessment, on the other, it simplifies the detection of defects in downstream automatic systems as it endows the input with highly accurate depth information. In this work, we provide an extensive overview of the system and give valuable insights into our design choices. △ Less

Submitted 6 March, 2023; originally announced March 2023.

Comments: 10 pages, published at VISAPP 2023, Lisbon, Portugal

Journal ref: Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, 997-1006, 2023 , Lisbon, Portugal

arXiv:2303.00050 [pdf, other]

Dynamic Multi-View Scene Reconstruction Using Neural Implicit Surface

Authors: Decai Chen, Haofei Lu, Ingo Feldmann, Oliver Schreer, Peter Eisert

Abstract: Reconstructing general dynamic scenes is important for many computer vision and graphics applications. Recent works represent the dynamic scene with neural radiance fields for photorealistic view synthesis, while their surface geometry is under-constrained and noisy. Other works introduce surface constraints to the implicit neural representation to disentangle the ambiguity of geometry and appeara… ▽ More Reconstructing general dynamic scenes is important for many computer vision and graphics applications. Recent works represent the dynamic scene with neural radiance fields for photorealistic view synthesis, while their surface geometry is under-constrained and noisy. Other works introduce surface constraints to the implicit neural representation to disentangle the ambiguity of geometry and appearance field for static scene reconstruction. To bridge the gap between rendering dynamic scenes and recovering static surface geometry, we propose a template-free method to reconstruct surface geometry and appearance using neural implicit representations from multi-view videos. We leverage topology-aware deformation and the signed distance field to learn complex dynamic surfaces via differentiable volume rendering without scene-specific prior knowledge like template models. Furthermore, we propose a novel mask-based ray selection strategy to significantly boost the optimization on challenging time-varying regions. Experiments on different multi-view video datasets demonstrate that our method achieves high-fidelity surface reconstruction as well as photorealistic novel view synthesis. △ Less

Submitted 28 February, 2023; originally announced March 2023.

Comments: 5 pages, accepted by ICASSP 2023

arXiv:2212.04386 [pdf, other]

Multi-View Mesh Reconstruction with Neural Deferred Shading

Authors: Markus Worchel, Rodrigo Diaz, Weiwen Hu, Oliver Schreer, Ingo Feldmann, Peter Eisert

Abstract: We propose an analysis-by-synthesis method for fast multi-view 3D reconstruction of opaque objects with arbitrary materials and illumination. State-of-the-art methods use both neural surface representations and neural rendering. While flexible, neural surface representations are a significant bottleneck in optimization runtime. Instead, we represent surfaces as triangle meshes and build a differen… ▽ More We propose an analysis-by-synthesis method for fast multi-view 3D reconstruction of opaque objects with arbitrary materials and illumination. State-of-the-art methods use both neural surface representations and neural rendering. While flexible, neural surface representations are a significant bottleneck in optimization runtime. Instead, we represent surfaces as triangle meshes and build a differentiable rendering pipeline around triangle rasterization and neural shading. The renderer is used in a gradient descent optimization where both a triangle mesh and a neural shader are jointly optimized to reproduce the multi-view images. We evaluate our method on a public 3D reconstruction dataset and show that it can match the reconstruction accuracy of traditional baselines and neural approaches while surpassing them in optimization runtime. Additionally, we investigate the shader and find that it learns an interpretable representation of appearance, enabling applications such as 3D material editing. △ Less

Submitted 8 December, 2022; originally announced December 2022.

Comments: CVPR 2022, project page: https://fraunhoferhhi.github.io/neural-deferred-shading/

arXiv:2211.15435 [pdf, other]

doi 10.1007/978-3-031-16788-1_13

Hyperspectral Demosaicing of Snapshot Camera Images Using Deep Learning

Authors: Eric L. Wisotzky, Charul Daudkhane, Anna Hilsmann, Peter Eisert

Abstract: Spectral imaging technologies have rapidly evolved during the past decades. The recent development of single-camera-one-shot techniques for hyperspectral imaging allows multiple spectral bands to be captured simultaneously (3x3, 4x4 or 5x5 mosaic), opening up a wide range of applications. Examples include intraoperative imaging, agricultural field inspection and food quality assessment. To capture… ▽ More Spectral imaging technologies have rapidly evolved during the past decades. The recent development of single-camera-one-shot techniques for hyperspectral imaging allows multiple spectral bands to be captured simultaneously (3x3, 4x4 or 5x5 mosaic), opening up a wide range of applications. Examples include intraoperative imaging, agricultural field inspection and food quality assessment. To capture images across a wide spectrum range, i.e. to achieve high spectral resolution, the sensor design sacrifices spatial resolution. With increasing mosaic size, this effect becomes increasingly detrimental. Furthermore, demosaicing is challenging. Without incorporating edge, shape, and object information during interpolation, chromatic artifacts are likely to appear in the obtained images. Recent approaches use neural networks for demosaicing, enabling direct information extraction from image data. However, obtaining training data for these approaches poses a challenge as well. This work proposes a parallel neural network based demosaicing procedure trained on a new ground truth dataset captured in a controlled environment by a hyperspectral snapshot camera with a 4x4 mosaic pattern. The dataset is a combination of real captured scenes with images from publicly available data adapted to the 4x4 mosaic pattern. To obtain real world ground-truth data, we performed multiple camera captures with 1-pixel shifts in order to compose the entire data cube. Experiments show that the proposed network outperforms state-of-art networks. △ Less

Submitted 21 November, 2022; originally announced November 2022.

Comments: German Conference on Pattern Recognition (GCPR) 2022

Journal ref: In DAGM German Conference on Pattern Recognition (pp. 198-212). Springer, Cham (2022)

arXiv:2211.11320 [pdf, other]

Recovering Fine Details for Neural Implicit Surface Reconstruction

Authors: Decai Chen, Peng Zhang, Ingo Feldmann, Oliver Schreer, Peter Eisert

Abstract: Recent works on implicit neural representations have made significant strides. Learning implicit neural surfaces using volume rendering has gained popularity in multi-view reconstruction without 3D supervision. However, accurately recovering fine details is still challenging, due to the underlying ambiguity of geometry and appearance representation. In this paper, we present D-NeuS, a volume rende… ▽ More Recent works on implicit neural representations have made significant strides. Learning implicit neural surfaces using volume rendering has gained popularity in multi-view reconstruction without 3D supervision. However, accurately recovering fine details is still challenging, due to the underlying ambiguity of geometry and appearance representation. In this paper, we present D-NeuS, a volume rendering-base neural implicit surface reconstruction method capable to recover fine geometry details, which extends NeuS by two additional loss functions targeting enhanced reconstruction quality. First, we encourage the rendered surface points from alpha compositing to have zero signed distance values, alleviating the geometry bias arising from transforming SDF to density for volume rendering. Second, we impose multi-view feature consistency on the surface points, derived by interpolating SDF zero-crossings from sampled points along rays. Extensive quantitative and qualitative results demonstrate that our method reconstructs high-accuracy surfaces with details, and outperforms the state of the art. △ Less

Submitted 21 November, 2022; originally announced November 2022.

arXiv:2210.05318 [pdf, other]

CASAPose: Class-Adaptive and Semantic-Aware Multi-Object Pose Estimation

Authors: Niklas Gard, Anna Hilsmann, Peter Eisert

Abstract: Applications in the field of augmented reality or robotics often require joint localisation and 6D pose estimation of multiple objects. However, most algorithms need one network per object class to be trained in order to provide the best results. Analysing all visible objects demands multiple inferences, which is memory and time-consuming. We present a new single-stage architecture called CASAPose… ▽ More Applications in the field of augmented reality or robotics often require joint localisation and 6D pose estimation of multiple objects. However, most algorithms need one network per object class to be trained in order to provide the best results. Analysing all visible objects demands multiple inferences, which is memory and time-consuming. We present a new single-stage architecture called CASAPose that determines 2D-3D correspondences for pose estimation of multiple different objects in RGB images in one pass. It is fast and memory efficient, and achieves high accuracy for multiple objects by exploiting the output of a semantic segmentation decoder as control input to a keypoint recognition decoder via local class-adaptive normalisation. Our new differentiable regression of keypoint locations significantly contributes to a faster closing of the domain gap between real test and synthetic training data. We apply segmentation-aware convolutions and upsampling operations to increase the focus inside the object mask and to reduce mutual interference of occluding objects. For each inserted object, the network grows by only one output segmentation map and a negligible number of parameters. We outperform state-of-the-art approaches in challenging multi-object scenes with inter-object occlusion and synthetic training. △ Less

Submitted 9 December, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

Comments: BMVC 2022, camera-ready version (this submission includes the paper and supplementary material)

arXiv:2208.13840 [pdf, other]

doi 10.1109/CVPRW56347.2022.00238

Perfusion assessment via local remote photoplethysmography (rPPG)

Authors: Benjamin Kossack, Eric Wisotzky, Peter Eisert, Sebastian P. Schraven, Brigitta Globke, Anna Hilsmann

Abstract: This paper presents an approach to assess the perfusion of visible human tissue from RGB video files. We propose metrics derived from remote photoplethysmography (rPPG) signals to detect whether a tissue is adequately supplied with blood. The perfusion analysis is done in three different scales, offering a flexible approach for different applications. We perform a plane-orthogonal-to-skin rPPG ind… ▽ More This paper presents an approach to assess the perfusion of visible human tissue from RGB video files. We propose metrics derived from remote photoplethysmography (rPPG) signals to detect whether a tissue is adequately supplied with blood. The perfusion analysis is done in three different scales, offering a flexible approach for different applications. We perform a plane-orthogonal-to-skin rPPG independently for locally defined regions of interest on each scale. From the extracted signals, we derive the signal-to-noise ratio, magnitude in the frequency domain, heart rate, perfusion index as well as correlation between specific rPPG signals in order to locally assess the perfusion of a specific region of human tissue. We show that locally resolved rPPG has a broad range of applications. As exemplary applications, we present results in intraoperative perfusion analysis and visualization during skin and organ transplantation as well as an application for liveliness assessment for the detection of presentation attacks to authentication systems. △ Less

Submitted 29 August, 2022; originally announced August 2022.

Comments: 10 pages, 6 figures, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

Journal ref: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

arXiv:2203.10087 [pdf, other]

But that's not why: Inference adjustment by interactive prototype revision

Authors: Michael Gerstenberger, Sebastian Lapuschkin, Peter Eisert, Sebastian Bosse

Abstract: Despite significant advances in machine learning, decision-making of artificial agents is still not perfect and often requires post-hoc human interventions. If the prediction of a model relies on unreasonable factors it is desirable to remove their effect. Deep interactive prototype adjustment enables the user to give hints and correct the model's reasoning. In this paper, we demonstrate that prot… ▽ More Despite significant advances in machine learning, decision-making of artificial agents is still not perfect and often requires post-hoc human interventions. If the prediction of a model relies on unreasonable factors it is desirable to remove their effect. Deep interactive prototype adjustment enables the user to give hints and correct the model's reasoning. In this paper, we demonstrate that prototypical-part models are well suited for this task as their prediction is based on prototypical image patches that can be interpreted semantically by the user. It shows that even correct classifications can rely on unreasonable prototypes that result from confounding variables in a dataset. Hence, we propose simple yet effective interaction schemes for inference adjustment: The user is consulted interactively to identify faulty prototypes. Non-object prototypes can be removed by prototype masking or a custom mode of deselection training. Interactive prototype rejection allows machine learning naïve users to adjust the logic of reasoning without compromising the accuracy. △ Less

Submitted 9 October, 2023; v1 submitted 18 March, 2022; originally announced March 2022.

arXiv:2202.13118 [pdf, other]

doi 10.1109/IC3D53758.2021.9687256

Accurate Human Body Reconstruction for Volumetric Video

Authors: Decai Chen, Markus Worchel, Ingo Feldmann, Oliver Schreer, Peter Eisert

Abstract: In this work, we enhance a professional end-to-end volumetric video production pipeline to achieve high-fidelity human body reconstruction using only passive cameras. While current volumetric video approaches estimate depth maps using traditional stereo matching techniques, we introduce and optimize deep learning-based multi-view stereo networks for depth map estimation in the context of professio… ▽ More In this work, we enhance a professional end-to-end volumetric video production pipeline to achieve high-fidelity human body reconstruction using only passive cameras. While current volumetric video approaches estimate depth maps using traditional stereo matching techniques, we introduce and optimize deep learning-based multi-view stereo networks for depth map estimation in the context of professional volumetric video reconstruction. Furthermore, we propose a novel depth map post-processing approach including filtering and fusion, by taking into account photometric confidence, cross-view geometric consistency, foreground masks as well as camera viewing frustums. We show that our method can generate high levels of geometric detail for reconstructed human bodies. △ Less

Submitted 26 February, 2022; originally announced February 2022.

Comments: 2021 International Conference on 3D Immersion (IC3D)

arXiv:2202.03074 [pdf, other]

Imposing Temporal Consistency on Deep Monocular Body Shape and Pose Estimation

Authors: Alexandra Zimmer, Anna Hilsmann, Wieland Morgenstern, Peter Eisert

Abstract: Accurate and temporally consistent modeling of human bodies is essential for a wide range of applications, including character animation, understanding human social behavior and AR/VR interfaces. Capturing human motion accurately from a monocular image sequence is still challenging and the modeling quality is strongly influenced by the temporal consistency of the captured body motion. Our work pre… ▽ More Accurate and temporally consistent modeling of human bodies is essential for a wide range of applications, including character animation, understanding human social behavior and AR/VR interfaces. Capturing human motion accurately from a monocular image sequence is still challenging and the modeling quality is strongly influenced by the temporal consistency of the captured body motion. Our work presents an elegant solution for the integration of temporal constraints in the fitting process. This does not only increase temporal consistency but also robustness during the optimization. In detail, we derive parameters of a sequence of body models, representing shape and motion of a person, including jaw poses, facial expressions, and finger poses. We optimize these parameters over the complete image sequence, fitting one consistent body shape while imposing temporal consistency on the body motion, assuming linear body joint trajectories over a short time. Our approach enables the derivation of realistic 3D body models from image sequences, including facial expression and articulated hands. In extensive experiments, we show that our approach results in accurately estimated body shape and motion, also for challenging movements and poses. Further, we apply it to the special application of sign language analysis, where accurate and temporal consistent motion modelling is essential, and show that the approach is well-suited for this kind of application. △ Less

Submitted 8 February, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

arXiv:2202.00315 [pdf, other]

doi 10.5220/0010893600003124

From Explanations to Segmentation: Using Explainable AI for Image Segmentation

Authors: Clemens Seibold, Johannes Künzel, Anna Hilsmann, Peter Eisert

Abstract: The new era of image segmentation leveraging the power of Deep Neural Nets (DNNs) comes with a price tag: to train a neural network for pixel-wise segmentation, a large amount of training samples has to be manually labeled on pixel-precision. In this work, we address this by following an indirect solution. We build upon the advances of the Explainable AI (XAI) community and extract a pixel-wise bi… ▽ More The new era of image segmentation leveraging the power of Deep Neural Nets (DNNs) comes with a price tag: to train a neural network for pixel-wise segmentation, a large amount of training samples has to be manually labeled on pixel-precision. In this work, we address this by following an indirect solution. We build upon the advances of the Explainable AI (XAI) community and extract a pixel-wise binary segmentation from the output of the Layer-wise Relevance Propagation (LRP) explaining the decision of a classification network. We show that we achieve similar results compared to an established U-Net segmentation architecture, while the generation of the training data is significantly simplified. The proposed method can be trained in a weakly supervised fashion, as the training samples must be only labeled on image-level, at the same time enabling the output of a segmentation mask. This makes it especially applicable to a wider range of real applications where tedious pixel-level labelling is often not possible. △ Less

Submitted 1 February, 2022; originally announced February 2022.

Comments: to be published in: 17th International Conference on Computer Vision Theory and Applications (VISAPP), February 2022

arXiv:2201.13278 [pdf, other]

doi 10.5220/0010882700003124

Combining Local and Global Pose Estimation for Precise Tracking of Similar Objects

Authors: Niklas Gard, Anna Hilsmann, Peter Eisert

Abstract: In this paper, we present a multi-object 6D detection and tracking pipeline for potentially similar and non-textured objects. The combination of a convolutional neural network for object classification and rough pose estimation with a local pose refinement and an automatic mismatch detection enables direct application in real-time AR scenarios. A new network architecture, trained solely with synth… ▽ More In this paper, we present a multi-object 6D detection and tracking pipeline for potentially similar and non-textured objects. The combination of a convolutional neural network for object classification and rough pose estimation with a local pose refinement and an automatic mismatch detection enables direct application in real-time AR scenarios. A new network architecture, trained solely with synthetic images, allows simultaneous pose estimation of multiple objects with reduced GPU memory consumption and enhanced performance. In addition, the pose estimates are further improved by a local edge-based refinement step that explicitly exploits known object geometry information. For continuous movements, the sole use of local refinement reduces pose mismatches due to geometric ambiguities or occlusions. We showcase the entire tracking pipeline and demonstrate the benefits of the combined approach. Experiments on a challenging set of non-textured similar objects demonstrate the enhanced quality compared to the baseline method. Finally, we illustrate how the system can be used in a real AR assistance application within the field of construction. △ Less

Submitted 31 January, 2022; originally announced January 2022.

Comments: Accepted at VISAPP 2022

arXiv:2111.15581 [pdf, other]

doi 10.5220/0010826500003124

Automated Damage Inspection of Power Transmission Towers from UAV Images

Authors: Aleixo Cambeiro Barreiro, Clemens Seibold, Anna Hilsmann, Peter Eisert

Abstract: Infrastructure inspection is a very costly task, requiring technicians to access remote or hard-to-reach places. This is the case for power transmission towers, which are sparsely located and require trained workers to climb them to search for damages. Recently, the use of drones or helicopters for remote recording is increasing in the industry, sparing the technicians this perilous task. This, ho… ▽ More Infrastructure inspection is a very costly task, requiring technicians to access remote or hard-to-reach places. This is the case for power transmission towers, which are sparsely located and require trained workers to climb them to search for damages. Recently, the use of drones or helicopters for remote recording is increasing in the industry, sparing the technicians this perilous task. This, however, leaves the problem of analyzing big amounts of images, which has great potential for automation. This is a challenging task for several reasons. First, the lack of freely available training data and the difficulty to collect it complicate this problem. Additionally, the boundaries of what constitutes a damage are fuzzy, introducing a degree of subjectivity in the labelling of the data. The unbalanced class distribution in the images also plays a role in increasing the difficulty of the task. This paper tackles the problem of structural damage detection in transmission towers, addressing these issues. Our main contributions are the development of a system for damage detection on remotely acquired drone images, applying techniques to overcome the issue of data scarcity and ambiguity, as well as the evaluation of the viability of such an approach to solve this particular problem. △ Less

Submitted 30 November, 2021; originally announced November 2021.

Comments: 8 pages, 10 figures, accepted for VISAPP 2022

arXiv:2108.04091 [pdf, other]

doi 10.1109/ICIP42928.2021.9506436

Zero in on Shape: A Generic 2D-3D Instance Similarity Metric learned from Synthetic Data

Authors: Maciej Janik, Niklas Gard, Anna Hilsmann, Peter Eisert

Abstract: We present a network architecture which compares RGB images and untextured 3D models by the similarity of the represented shape. Our system is optimised for zero-shot retrieval, meaning it can recognise shapes never shown in training. We use a view-based shape descriptor and a siamese network to learn object geometry from pairs of 3D models and 2D images. Due to scarcity of datasets with exact pho… ▽ More We present a network architecture which compares RGB images and untextured 3D models by the similarity of the represented shape. Our system is optimised for zero-shot retrieval, meaning it can recognise shapes never shown in training. We use a view-based shape descriptor and a siamese network to learn object geometry from pairs of 3D models and 2D images. Due to scarcity of datasets with exact photograph-mesh correspondences, we train our network with only synthetic data. Our experiments investigate the effect of different qualities and quantities of training data on retrieval accuracy and present insights from bridging the domain gap. We show that increasing the variety of synthetic data improves retrieval accuracy and that our system's performance in zero-shot mode can match that of the instance-aware mode, as far as narrowing down the search to the top 10% of objects. △ Less

Submitted 9 August, 2021; originally announced August 2021.

Comments: Accepted to ICIP 2021

arXiv:2103.14697 [pdf, other]

Focused LRP: Explainable AI for Face Morphing Attack Detection

Authors: Clemens Seibold, Anna Hilsmann, Peter Eisert

Abstract: The task of detecting morphed face images has become highly relevant in recent years to ensure the security of automatic verification systems based on facial images, e.g. automated border control gates. Detection methods based on Deep Neural Networks (DNN) have been shown to be very suitable to this end. However, they do not provide transparency in the decision making and it is not clear how they… ▽ More The task of detecting morphed face images has become highly relevant in recent years to ensure the security of automatic verification systems based on facial images, e.g. automated border control gates. Detection methods based on Deep Neural Networks (DNN) have been shown to be very suitable to this end. However, they do not provide transparency in the decision making and it is not clear how they distinguish between genuine and morphed face images. This is particularly relevant for systems intended to assist a human operator, who should be able to understand the reasoning. In this paper, we tackle this problem and present Focused Layer-wise Relevance Propagation (FLRP). This framework explains to a human inspector on a precise pixel level, which image regions are used by a Deep Neural Network to distinguish between a genuine and a morphed face image. Additionally, we propose another framework to objectively analyze the quality of our method and compare FLRP to other DNN interpretability methods. This evaluation framework is based on removing detected artifacts and analyzing the influence of these changes on the decision of the DNN. Especially, if the DNN is uncertain in its decision or even incorrect, FLRP performs much better in highlighting visible artifacts compared to other methods. △ Less

Submitted 26 March, 2021; originally announced March 2021.

Comments: Published at WACVW 2021

arXiv:2101.01133 [pdf, other]

Stereo Correspondence and Reconstruction of Endoscopic Data Challenge

Authors: Max Allan, Jonathan Mcleod, Congcong Wang, Jean Claude Rosenthal, Zhenglei Hu, Niklas Gard, Peter Eisert, Ke Xue Fu, Trevor Zeffiro, Wenyao Xia, Zhanshi Zhu, Huoling Luo, Fucang Jia, Xiran Zhang, Xiaohong Li, Lalith Sharan, Tom Kurmann, Sebastian Schmid, Raphael Sznitman, Dimitris Psychogyios, Mahdi Azizian, Danail Stoyanov, Lena Maier-Hein, Stefanie Speidel

Abstract: The stereo correspondence and reconstruction of endoscopic data sub-challenge was organized during the Endovis challenge at MICCAI 2019 in Shenzhen, China. The task was to perform dense depth estimation using 7 training datasets and 2 test sets of structured light data captured using porcine cadavers. These were provided by a team at Intuitive Surgical. 10 teams participated in the challenge day.… ▽ More The stereo correspondence and reconstruction of endoscopic data sub-challenge was organized during the Endovis challenge at MICCAI 2019 in Shenzhen, China. The task was to perform dense depth estimation using 7 training datasets and 2 test sets of structured light data captured using porcine cadavers. These were provided by a team at Intuitive Surgical. 10 teams participated in the challenge day. This paper contains 3 additional methods which were submitted after the challenge finished as well as a supplemental section from these teams on issues they found with the dataset. △ Less

Submitted 28 January, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

arXiv:2009.10361 [pdf, other]

Neural Face Models for Example-Based Visual Speech Synthesis

Authors: Wolfgang Paier, Anna Hilsmann, Peter Eisert

Abstract: Creating realistic animations of human faces with computer graphic models is still a challenging task. It is often solved either with tedious manual work or motion capture based techniques that require specialised and costly hardware. Example based animation approaches circumvent these problems by re-using captured data of real people. This data is split into short motion samples that can be loope… ▽ More Creating realistic animations of human faces with computer graphic models is still a challenging task. It is often solved either with tedious manual work or motion capture based techniques that require specialised and costly hardware. Example based animation approaches circumvent these problems by re-using captured data of real people. This data is split into short motion samples that can be looped or concatenated in order to create novel motion sequences. The obvious advantages of this approach are the simplicity of use and the high realism, since the data exhibits only real deformations. Rather than tuning weights of a complex face rig, the animation task is performed on a higher level by arranging typical motion samples in a way such that the desired facial performance is achieved. Two difficulties with example based approaches, however, are high memory requirements as well as the creation of artefact-free and realistic transitions between motion samples. We solve these problems by combining the realism and simplicity of example-based animations with the advantages of neural face models. Our neural face model is capable of synthesising high quality 3D face geometry and texture according to a compact latent parameter vector. This latent representation reduces memory requirements by a factor of 100 and helps creating seamless transitions between concatenated motion samples. In this paper, we present a marker-less approach for facial motion capture based on multi-view video. Based on the captured data, we learn a neural representation of facial expressions, which is used to seamlessly concatenate facial performances during the animation procedure. We demonstrate the effectiveness of our approach by synthesising mouthings for Swiss-German sign language based on viseme query sequences. △ Less

Submitted 22 September, 2020; originally announced September 2020.

ACM Class: I.3.5; I.4.8

arXiv:2009.00922 [pdf, other]

doi 10.1049/iet-cvi.2019.0786

Going beyond Free Viewpoint: Creating Animatable Volumetric Video of Human Performances

Authors: Anna Hilsmann, Philipp Fechteler, Wieland Morgenstern, Wolfgang Paier, Ingo Feldmann, Oliver Schreer, Peter Eisert

Abstract: In this paper, we present an end-to-end pipeline for the creation of high-quality animatable volumetric video content of human performances. Going beyond the application of free-viewpoint volumetric video, we allow re-animation and alteration of an actor's performance through (i) the enrichment of the captured data with semantics and animation properties and (ii) applying hybrid geometry- and vide… ▽ More In this paper, we present an end-to-end pipeline for the creation of high-quality animatable volumetric video content of human performances. Going beyond the application of free-viewpoint volumetric video, we allow re-animation and alteration of an actor's performance through (i) the enrichment of the captured data with semantics and animation properties and (ii) applying hybrid geometry- and video-based animation methods that allow a direct animation of the high-quality data itself instead of creating an animatable model that resembles the captured data. Semantic enrichment and geometric animation ability are achieved by establishing temporal consistency in the 3D data, followed by an automatic rigging of each frame using a parametric shape-adaptive full human body model. Our hybrid geometry- and video-based animation approaches combine the flexibility of classical CG animation with the realism of real captured data. For pose editing, we exploit the captured data as much as possible and kinematically deform the captured frames to fit a desired pose. Further, we treat the face differently from the body in a hybrid geometry- and video-based animation approach where coarse movements and poses are modeled in the geometry only, while very fine and subtle details in the face, often lacking in purely geometric methods, are captured in video-based textures. These are processed to be interactively combined to form new facial expressions. On top of that, we learn the appearance of regions that are challenging to synthesize, such as the teeth or the eyes, and fill in missing regions realistically in an autoencoder-based approach. This paper covers the full pipeline from capturing and producing high-quality video content, over the enrichment with semantics and deformation properties for re-animation and processing of the data for the final hybrid animation. △ Less

Submitted 2 September, 2020; originally announced September 2020.

Journal ref: ET Computer Vision, Special Issue on Computer Vision for the Creative Industries (2020)

arXiv:2004.11435 [pdf, other]

Style Your Face Morph and Improve Your Face Morphing Attack Detector

Authors: Clemens Seibold, Anna Hilsmann, Peter Eisert

Abstract: A morphed face image is a synthetically created image that looks so similar to the faces of two subjects that both can use it for verification against a biometric verification system. It can be easily created by aligning and blending face images of the two subjects. In this paper, we propose a style transfer based method that improves the quality of morphed face images. It counters the image degen… ▽ More A morphed face image is a synthetically created image that looks so similar to the faces of two subjects that both can use it for verification against a biometric verification system. It can be easily created by aligning and blending face images of the two subjects. In this paper, we propose a style transfer based method that improves the quality of morphed face images. It counters the image degeneration during the creation of morphed face images caused by blending. We analyze different state of the art face morphing attack detection systems regarding their performance against our improved morphed face images and other methods that improve the image quality. All detection systems perform significantly worse, when first confronted with our improved morphed face images. Most of them can be enhanced by adding our quality improved morphs to the training data, which further improves the robustness against other means of quality improvement. △ Less

Submitted 23 April, 2020; originally announced April 2020.

Comments: Published at BIOSIG 2019

arXiv:1912.05222 [pdf, other]

doi 10.1109/WACV.2018.00223

Automatic Analysis of Sewer Pipes Based on Unrolled Monocular Fisheye Images

Authors: Johannes Künzel, Thomas Werner, Ronja Möller, Peter Eisert, Jan Waschnewski, Ralf Hilpert

Abstract: The task of detecting and classifying damages in sewer pipes offers an important application area for computer vision algorithms. This paper describes a system, which is capable of accomplishing this task solely based on low quality and severely compressed fisheye images from a pipe inspection robot. Relying on robust image features, we estimate camera poses, model the image lighting, and exploit… ▽ More The task of detecting and classifying damages in sewer pipes offers an important application area for computer vision algorithms. This paper describes a system, which is capable of accomplishing this task solely based on low quality and severely compressed fisheye images from a pipe inspection robot. Relying on robust image features, we estimate camera poses, model the image lighting, and exploit this information to generate high quality cylindrical unwraps of the pipes' surfaces.Based on the generated images, we apply semantic labeling based on deep convolutional neural networks to detect and classify defects as well as structural elements. △ Less

Submitted 11 December, 2019; originally announced December 2019.

Comments: Published in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

Journal ref: J. Kunzel, T. Werner, P. Eisert and J. Waschnewski, "Automatic Analysis of Sewer Pipes Based on Unrolled Monocular Fisheye Images," 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, 2018, pp. 2019-2027

arXiv:1807.02030 [pdf, other]

Reflection Analysis for Face Morphing Attack Detection

Authors: Clemens Seibold, Anna Hilsmann, Peter Eisert

Abstract: A facial morph is a synthetically created image of a face that looks similar to two different individuals and can even trick biometric facial recognition systems into recognizing both individuals. This attack is known as face morphing attack. The process of creating such a facial morph is well documented and a lot of tutorials and software to create them are freely available. Therefore, it is mand… ▽ More A facial morph is a synthetically created image of a face that looks similar to two different individuals and can even trick biometric facial recognition systems into recognizing both individuals. This attack is known as face morphing attack. The process of creating such a facial morph is well documented and a lot of tutorials and software to create them are freely available. Therefore, it is mandatory to be able to detect this kind of fraud to ensure the integrity of the face as reliable biometric feature. In this work, we study the effects of face morphing on the physically correctness of the illumination. We estimate the direction to the light sources based on specular highlights in the eyes and use them to generate a synthetic map for highlights on the skin. This map is compared with the highlights in the image that is suspected to be a fraud. Morphing faces with different geometries, a bad alignment of the source images or using images with different illuminations, can lead to inconsistencies in reflections that indicate the existence of a morphing attack. △ Less

Submitted 5 July, 2018; originally announced July 2018.

Comments: 5 pages, 6 figures, accepted at EUSIPCO 2018

arXiv:1806.04265 [pdf, other]

Accurate and Robust Neural Networks for Security Related Applications Exampled by Face Morphing Attacks

Authors: Clemens Seibold, Wojciech Samek, Anna Hilsmann, Peter Eisert

Abstract: Artificial neural networks tend to learn only what they need for a task. A manipulation of the training data can counter this phenomenon. In this paper, we study the effect of different alterations of the training data, which limit the amount and position of information that is available for the decision making. We analyze the accuracy and robustness against semantic and black box attacks on the n… ▽ More Artificial neural networks tend to learn only what they need for a task. A manipulation of the training data can counter this phenomenon. In this paper, we study the effect of different alterations of the training data, which limit the amount and position of information that is available for the decision making. We analyze the accuracy and robustness against semantic and black box attacks on the networks that were trained on different training data modifications for the particular example of morphing attacks. A morphing attack is an attack on a biometric facial recognition system where the system is fooled to match two different individuals with the same synthetic face image. Such a synthetic image can be created by aligning and blending images of the two individuals that should be matched with this image. △ Less

Submitted 11 June, 2018; originally announced June 2018.

Comments: 16 pages, 7 figures

Showing 1–42 of 42 results for author: Eisert, P