Search | arXiv e-print repository

ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D image

Authors: Marco Pesavento, Yuanlu Xu, Nikolaos Sarafianos, Robert Maier, Ziyan Wang, Chun-Han Yao, Marco Volino, Edmond Boyer, Adrian Hilton, Tony Tung

Abstract: Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries alon… ▽ More Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries along the camera optical axis. In this paper, we explore the benefits of incorporating depth observations in the reconstruction process by introducing ANIM, a novel method that reconstructs arbitrary 3D human shapes from single-view RGB-D images with an unprecedented level of accuracy. Our model learns geometric details from both multi-resolution pixel-aligned and voxel-aligned features to leverage depth information and enable spatial relationships, mitigating depth ambiguities. We further enhance the quality of the reconstructed shape by introducing a depth-supervision strategy, which improves the accuracy of the signed distance field estimation of points that lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms state-of-the-art works that use RGB, surface normals, point cloud or RGB-D data as input. In addition, we introduce ANIM-Real, a new multi-modal dataset comprising high-quality scans paired with consumer-grade RGB-D camera, and our protocol to fine-tune ANIM, enabling high-quality reconstruction from real-world human capture. △ Less

Submitted 18 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR24; Project page: https://marcopesavento.github.io/ANIM/

arXiv:2401.08360 [pdf, other]

AdaSem: Adaptive Goal-Oriented Semantic Communications for End-to-End Camera Relocalization

Authors: Qi Liao, Tze-Yang Tung

Abstract: Recently, deep autoencoders have gained traction as a powerful method for implementing goal-oriented semantic communications systems. The idea is to train a map** from the source domain directly to channel symbols, and vice versa. However, prior studies often focused on rate-distortion tradeoff and transmission delay, at the cost of increasing end-to-end complexity and thus latency. Moreover, th… ▽ More Recently, deep autoencoders have gained traction as a powerful method for implementing goal-oriented semantic communications systems. The idea is to train a map** from the source domain directly to channel symbols, and vice versa. However, prior studies often focused on rate-distortion tradeoff and transmission delay, at the cost of increasing end-to-end complexity and thus latency. Moreover, the datasets used are often not reflective of real-world environments, and the results were not validated against real-world baseline systems, leading to an unfair comparison. In this paper, we study the problem of remote camera pose estimation and propose AdaSem, an adaptive semantic communications approach that optimizes the tradeoff between inference accuracy and end-to-end latency. We develop an adaptive semantic codec model, which encodes the source data into a dynamic number of symbols, based on the latent space distribution and the channel state feedback. We utilize a lightweight model for both transmitter and receiver to ensure comparable complexity to the baseline implemented in a real-world system. Extensive experiments on real-environment data show the effectiveness of our approach. When compared to a real implementation of a client-server camera relocalization service, AdaSem outperforms the baseline by reducing the end-to-end delay and estimation error by over 75% and 63%, respectively. △ Less

Submitted 24 May, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: IEEE INFOCOM 2024

arXiv:2401.03754 [pdf, other]

Joint Power Allocation and User Scheduling in Integrated Satellite-Terrestrial Cell-Free Massive MIMO IoT Systems

Authors: Trinh Van Chien, Ha An Le, Ta Hai Tung, Hien Quoc Ngo, Symeon Chatzinotas

Abstract: Both space and ground communications have been proven effective solutions under different perspectives in Internet of Things (IoT) networks. This paper investigates multiple-access scenarios, where plenty of IoT users are cooperatively served by a satellite in space and access points (APs) on the ground. Available users in each coherence interval are split into scheduled and unscheduled subsets to… ▽ More Both space and ground communications have been proven effective solutions under different perspectives in Internet of Things (IoT) networks. This paper investigates multiple-access scenarios, where plenty of IoT users are cooperatively served by a satellite in space and access points (APs) on the ground. Available users in each coherence interval are split into scheduled and unscheduled subsets to optimize limited radio resources. We compute the uplink ergodic throughput of each scheduled user under imperfect channel state information (CSI) and non-orthogonal pilot signals. As maximum-radio combining is deployed locally at the ground gateway and the APs, the uplink ergodic throughput is obtained in a closed-form expression. The analytical results explicitly unveil the effects of channel conditions and pilot contamination on each scheduled user. By maximizing the sum throughput, the system can simultaneously determine scheduled users and perform power allocation based on either a model-based approach with alternating optimization or a learning-based approach with the graph neural network. Numerical results manifest that integrated satellite-terrestrial cell-free massive multiple-input multiple-output systems can significantly improve the sum ergodic throughput over coherence intervals. The integrated systems can schedule the vast majority of users; some might be out of service due to the limited power budget. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: 15 pages, 10 figures, 1 table. Submitted for publication

arXiv:2312.17192 [pdf, other]

HISR: Hybrid Implicit Surface Representation for Photorealistic 3D Human Reconstruction

Authors: Angtian Wang, Yuanlu Xu, Nikolaos Sarafianos, Robert Maier, Edmond Boyer, Alan Yuille, Tony Tung

Abstract: Neural reconstruction and rendering strategies have demonstrated state-of-the-art performances due, in part, to their ability to preserve high level shape details. Existing approaches, however, either represent objects as implicit surface functions or neural volumes and still struggle to recover shapes with heterogeneous materials, in particular human skin, hair or clothes. To this aim, we present… ▽ More Neural reconstruction and rendering strategies have demonstrated state-of-the-art performances due, in part, to their ability to preserve high level shape details. Existing approaches, however, either represent objects as implicit surface functions or neural volumes and still struggle to recover shapes with heterogeneous materials, in particular human skin, hair or clothes. To this aim, we present a new hybrid implicit surface representation to model human shapes. This representation is composed of two surface layers that represent opaque and translucent regions on the clothed human body. We segment different regions automatically using visual cues and learn to reconstruct two signed distance functions (SDFs). We perform surface-based rendering on opaque regions (e.g., body, face, clothes) to preserve high-fidelity surface normals and volume rendering on translucent regions (e.g., hair). Experiments demonstrate that our approach obtains state-of-the-art results on 3D human reconstructions, and also shows competitive performances on other objects. △ Less

Submitted 28 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI 2024 main track

arXiv:2308.14847 [pdf, other]

NSF: Neural Surface Fields for Human Modeling from Monocular Depth

Authors: Yuxuan Xue, Bharat Lal Bhatnagar, Riccardo Marin, Nikolaos Sarafianos, Yuanlu Xu, Gerard Pons-Moll, Tony Tung

Abstract: Obtaining personalized 3D animatable avatars from a monocular camera has several real world applications in gaming, virtual try-on, animation, and VR/XR, etc. However, it is very challenging to model dynamic and fine-grained clothing deformations from such sparse data. Existing methods for modeling 3D humans from depth data have limitations in terms of computational efficiency, mesh coherency, and… ▽ More Obtaining personalized 3D animatable avatars from a monocular camera has several real world applications in gaming, virtual try-on, animation, and VR/XR, etc. However, it is very challenging to model dynamic and fine-grained clothing deformations from such sparse data. Existing methods for modeling 3D humans from depth data have limitations in terms of computational efficiency, mesh coherency, and flexibility in resolution and topology. For instance, reconstructing shapes using implicit functions and extracting explicit meshes per frame is computationally expensive and cannot ensure coherent meshes across frames. Moreover, predicting per-vertex deformations on a pre-designed human template with a discrete surface lacks flexibility in resolution and topology. To overcome these limitations, we propose a novel method Neural Surface Fields for modeling 3D clothed humans from monocular depth. NSF defines a neural field solely on the base surface which models a continuous and flexible displacement field. NSF can be adapted to the base surface with different resolution and topology without retraining at inference time. Compared to existing approaches, our method eliminates the expensive per-frame surface extraction while maintaining mesh coherency, and is capable of reconstructing meshes with arbitrary resolution without retraining. To foster research in this direction, we release our code in project page at: https://yuxuan-xue.com/nsf. △ Less

Submitted 27 October, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

Comments: Accpted to ICCV 2023; Homepage at: https://yuxuan-xue.com/nsf

arXiv:2304.00020 [pdf, ps, other]

SemiMemes: A Semi-supervised Learning Approach for Multimodal Memes Analysis

Authors: Pham Thai Hoang Tung, Nguyen Tan Viet, Ngo Tien Anh, Phan Duy Hung

Abstract: The prevalence of memes on social media has created the need to sentiment analyze their underlying meanings for censoring harmful content. Meme censoring systems by machine learning raise the need for a semi-supervised learning solution to take advantage of the large number of unlabeled memes available on the internet and make the annotation process less challenging. Moreover, the approach needs t… ▽ More The prevalence of memes on social media has created the need to sentiment analyze their underlying meanings for censoring harmful content. Meme censoring systems by machine learning raise the need for a semi-supervised learning solution to take advantage of the large number of unlabeled memes available on the internet and make the annotation process less challenging. Moreover, the approach needs to utilize multimodal data as memes' meanings usually come from both images and texts. This research proposes a multimodal semi-supervised learning approach that outperforms other multimodal semi-supervised learning and supervised learning state-of-the-art models on two datasets, the Multimedia Automatic Misogyny Identification and Hateful Memes dataset. Building on the insights gained from Contrastive Language-Image Pre-training, which is an effective multimodal learning technique, this research introduces SemiMemes, a novel training method that combines auto-encoder and classification task to make use of the resourceful unlabeled data. △ Less

Submitted 16 May, 2023; v1 submitted 31 March, 2023; originally announced April 2023.

Comments: ICCCI 2023

arXiv:2303.15893 [pdf, other]

VIVE3D: Viewpoint-Independent Video Editing using 3D-Aware GANs

Authors: Anna Frühstück, Nikolaos Sarafianos, Yuanlu Xu, Peter Wonka, Tony Tung

Abstract: We introduce VIVE3D, a novel approach that extends the capabilities of image-based 3D GANs to video editing and is able to represent the input video in an identity-preserving and temporally consistent way. We propose two new building blocks. First, we introduce a novel GAN inversion technique specifically tailored to 3D GANs by jointly embedding multiple frames and optimizing for the camera parame… ▽ More We introduce VIVE3D, a novel approach that extends the capabilities of image-based 3D GANs to video editing and is able to represent the input video in an identity-preserving and temporally consistent way. We propose two new building blocks. First, we introduce a novel GAN inversion technique specifically tailored to 3D GANs by jointly embedding multiple frames and optimizing for the camera parameters. Second, besides traditional semantic face edits (e.g. for age and expression), we are the first to demonstrate edits that show novel views of the head enabled by the inherent properties of 3D GANs and our optical flow-guided compositing technique to combine the head with the background video. Our experiments demonstrate that VIVE3D generates high-fidelity face edits at consistent quality from a range of camera viewpoints which are composited with the original video in a temporally and spatially consistent manner. △ Less

Submitted 28 March, 2023; originally announced March 2023.

Comments: CVPR 2023. Project webpage and video available at http://afruehstueck.github.io/vive3D

arXiv:2211.13772 [pdf, other]

Generative Joint Source-Channel Coding for Semantic Image Transmission

Authors: Ecenaz Erdemir, Tze-Yang Tung, Pier Luigi Dragotti, Deniz Gunduz

Abstract: Recent works have shown that joint source-channel coding (JSCC) schemes using deep neural networks (DNNs), called DeepJSCC, provide promising results in wireless image transmission. However, these methods mostly focus on the distortion of the reconstructed signals with respect to the input image, rather than their perception by humans. However, focusing on traditional distortion metrics alone does… ▽ More Recent works have shown that joint source-channel coding (JSCC) schemes using deep neural networks (DNNs), called DeepJSCC, provide promising results in wireless image transmission. However, these methods mostly focus on the distortion of the reconstructed signals with respect to the input image, rather than their perception by humans. However, focusing on traditional distortion metrics alone does not necessarily result in high perceptual quality, especially in extreme physical conditions, such as very low bandwidth compression ratio (BCR) and low signal-to-noise ratio (SNR) regimes. In this work, we propose two novel JSCC schemes that leverage the perceptual quality of deep generative models (DGMs) for wireless image transmission, namely InverseJSCC and GenerativeJSCC. While the former is an inverse problem approach to DeepJSCC, the latter is an end-to-end optimized JSCC scheme. In both, we optimize a weighted sum of mean squared error (MSE) and learned perceptual image patch similarity (LPIPS) losses, which capture more semantic similarities than other distortion metrics. InverseJSCC performs denoising on the distorted reconstructions of a DeepJSCC model by solving an inverse optimization problem using style-based generative adversarial network (StyleGAN). Our simulation results show that InverseJSCC significantly improves the state-of-the-art (SotA) DeepJSCC in terms of perceptual quality in edge cases. In GenerativeJSCC, we carry out end-to-end training of an encoder and a StyleGAN-based decoder, and show that GenerativeJSCC significantly outperforms DeepJSCC both in terms of distortion and perceptual quality. △ Less

Submitted 24 November, 2022; originally announced November 2022.

Comments: 12 pages, 9 figures

arXiv:2211.08747 [pdf, other]

Deep Joint Source-Channel Coding for Semantic Communications

Authors: Jialong Xu, Tze-Yang Tung, Bo Ai, Wei Chen, Yuxuan Sun, Deniz Gunduz

Abstract: Semantic communications is considered as a promising technology to increase the efficiency of next-generation communication systems, particularly targeting human-machine and machine-type communications. In contrast to the source-agnostic approach of conventional wireless communication systems, semantic communication seeks to ensure that only the relevant information for the underlying task is comm… ▽ More Semantic communications is considered as a promising technology to increase the efficiency of next-generation communication systems, particularly targeting human-machine and machine-type communications. In contrast to the source-agnostic approach of conventional wireless communication systems, semantic communication seeks to ensure that only the relevant information for the underlying task is communicated to the receiver. Considering that most semantic communication applications have strict latency, bandwidth, and power constraints, a prominent approach is to model them as a joint source-channel coding (JSCC) problem. Although JSCC has been a long-standing open problem in communication and coding theory, remarkable performance gains have been shown recently over existing separate source and channel coding systems, particularly in low-latency and low-power scenarios. Recent progress is thanks to the adoption of deep learning techniques for joint source-channel code design that outperform the concatenation of state-of-the-art compression and channel coding schemes, which are results of decades-long research efforts. In this article, we present an adaptive deep learning based JSCC (DeepJSCC) architecture for semantic communications, introduce its design principles, highlight its benefits, and outline future research challenges that lie ahead. △ Less

Submitted 18 July, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

Comments: 7 pages, 6 figures

arXiv:2209.00082 [pdf, other]

Multi-View Reconstruction using Signed Ray Distance Functions (SRDF)

Authors: Pierre Zins, Yuanlu Xu, Edmond Boyer, Stefanie Wuhrer, Tony Tung

Abstract: In this paper, we investigate a new optimization framework for multi-view 3D shape reconstructions. Recent differentiable rendering approaches have provided breakthrough performances with implicit shape representations though they can still lack precision in the estimated geometries. On the other hand multi-view stereo methods can yield pixel wise geometric accuracy with local depth predictions al… ▽ More In this paper, we investigate a new optimization framework for multi-view 3D shape reconstructions. Recent differentiable rendering approaches have provided breakthrough performances with implicit shape representations though they can still lack precision in the estimated geometries. On the other hand multi-view stereo methods can yield pixel wise geometric accuracy with local depth predictions along viewing rays. Our approach bridges the gap between the two strategies with a novel volumetric shape representation that is implicit but parameterized with pixel depths to better materialize the shape surface with consistent signed distances along viewing rays. The approach retains pixel-accuracy while benefiting from volumetric integration in the optimization. To this aim, depths are optimized by evaluating, at each 3D location within the volumetric discretization, the agreement between the depth prediction consistency and the photometric consistency for the corresponding pixels. The optimization is agnostic to the associated photo-consistency term which can vary from a median-based baseline to more elaborate criteria learned functions. Our experiments demonstrate the benefit of the volumetric integration with depth predictions. They also show that our approach outperforms existing approaches over standard 3D benchmarks with better geometry estimations. △ Less

Submitted 3 March, 2023; v1 submitted 31 August, 2022; originally announced September 2022.

arXiv:2208.09245 [pdf, other]

Deep Joint Source-Channel and Encryption Coding: Secure Semantic Communications

Authors: Tze-Yang Tung, Deniz Gunduz

Abstract: Deep learning driven joint source-channel coding (JSCC) for wireless image or video transmission, also called DeepJSCC, has been a topic of interest recently with very promising results. The idea is to map similar source samples to nearby points in the channel input space such that, despite the noise introduced by the channel, the input can be recovered with minimal distortion. In DeepJSCC, this i… ▽ More Deep learning driven joint source-channel coding (JSCC) for wireless image or video transmission, also called DeepJSCC, has been a topic of interest recently with very promising results. The idea is to map similar source samples to nearby points in the channel input space such that, despite the noise introduced by the channel, the input can be recovered with minimal distortion. In DeepJSCC, this is achieved by an autoencoder architecture with a non-trainable channel layer between the encoder and decoder. DeepJSCC has many favorable properties, such as better end-to-end distortion performance than its separate source and channel coding counterpart as well as graceful degradation with respect to channel quality. However, due to the inherent correlation between the source sample and channel input, DeepJSCC is vulnerable to eavesdrop** attacks. In this paper, we propose the first DeepJSCC scheme for wireless image transmission that is secure against eavesdroppers, called DeepJSCEC. DeepJSCEC not only preserves the favorable properties of DeepJSCC, it also provides security against chosen-plaintext attacks from the eavesdropper, without the need to make assumptions about the eavesdropper's channel condition, or its intended use of the intercepted signal. Numerical results show that DeepJSCEC achieves similar or better image quality than separate source coding using BPG compression, AES encryption, and LDPC codes for channel coding, while preserving the graceful degradation of image quality with respect to channel quality. We also show that the proposed encryption method is problem agnostic, meaning it can be applied to other end-to-end JSCC problems, such as remote classification, without modification. Given the importance of security in modern wireless communication systems, we believe this work brings DeepJSCC schemes much closer to adoption in practice. △ Less

Submitted 31 August, 2022; v1 submitted 19 August, 2022; originally announced August 2022.

arXiv:2207.13807 [pdf, other]

Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields

Authors: Garvita Tiwari, Dimitrije Antic, Jan Eric Lenssen, Nikolaos Sarafianos, Tony Tung, Gerard Pons-Moll

Abstract: We present Pose-NDF, a continuous model for plausible human poses based on neural distance fields (NDFs). Pose or motion priors are important for generating realistic new poses and for reconstructing accurate poses from noisy or partial observations. Pose-NDF learns a manifold of plausible poses as the zero level set of a neural implicit function, extending the idea of modeling implicit surfaces i… ▽ More We present Pose-NDF, a continuous model for plausible human poses based on neural distance fields (NDFs). Pose or motion priors are important for generating realistic new poses and for reconstructing accurate poses from noisy or partial observations. Pose-NDF learns a manifold of plausible poses as the zero level set of a neural implicit function, extending the idea of modeling implicit surfaces in 3D to the high-dimensional domain SO(3)^K, where a human pose is defined by a single data point, represented by K quaternions. The resulting high-dimensional implicit function can be differentiated with respect to the input poses and thus can be used to project arbitrary poses onto the manifold by using gradient descent on the set of 3-dimensional hyperspheres. In contrast to previous VAE-based human pose priors, which transform the pose space into a Gaussian distribution, we model the actual pose manifold, preserving the distances between poses. We demonstrate that PoseNDF outperforms existing state-of-the-art methods as a prior in various downstream tasks, ranging from denoising real-world human mocap data, pose recovery from occluded data to 3D pose reconstruction from images. Furthermore, we show that it can be used to generate more diverse poses by random sampling and projection than VAE-based methods. △ Less

Submitted 27 July, 2022; originally announced July 2022.

Comments: Project page: https://virtualhumans.mpi-inf.mpg.de/posendf

Journal ref: European Conference on Computer Vision (ECCV 2022), Oral Presentation

arXiv:2206.08100 [pdf, other]

DeepJSCC-Q: Constellation Constrained Deep Joint Source-Channel Coding

Authors: Tze-Yang Tung, David Burth Kurka, Mikolaj Jankowski, Deniz Gunduz

Abstract: Recent works have shown that modern machine learning techniques can provide an alternative approach to the long-standing joint source-channel coding (JSCC) problem. Very promising initial results, superior to popular digital schemes that utilize separate source and channel codes, have been demonstrated for wireless image and video transmission using deep neural networks (DNNs). However, end-to-end… ▽ More Recent works have shown that modern machine learning techniques can provide an alternative approach to the long-standing joint source-channel coding (JSCC) problem. Very promising initial results, superior to popular digital schemes that utilize separate source and channel codes, have been demonstrated for wireless image and video transmission using deep neural networks (DNNs). However, end-to-end training of such schemes requires a differentiable channel input representation; hence, prior works have assumed that any complex value can be transmitted over the channel. This can prevent the application of these codes in scenarios where the hardware or protocol can only admit certain sets of channel inputs, prescribed by a digital constellation. Herein, we propose DeepJSCC-Q, an end-to-end optimized JSCC solution for wireless image transmission using a finite channel input alphabet. We show that DeepJSCC-Q can achieve similar performance to prior works that allow any complex valued channel input, especially when high modulation orders are available, and that the performance asymptotically approaches that of unconstrained channel input as the modulation order increases. Importantly, DeepJSCC-Q preserves the graceful degradation of image quality in unpredictable channel conditions, a desirable property for deployment in mobile systems with rapidly changing channel conditions. △ Less

Submitted 16 June, 2022; originally announced June 2022.

Comments: arXiv admin note: text overlap with arXiv:2111.13042

arXiv:2205.09111 [pdf, other]

BodyMap: Learning Full-Body Dense Correspondence Map

Authors: Anastasia Ianina, Nikolaos Sarafianos, Yuanlu Xu, Ignacio Rocco, Tony Tung

Abstract: Dense correspondence between humans carries powerful semantic information that can be utilized to solve fundamental problems for full-body understanding such as in-the-wild surface matching, tracking and reconstruction. In this paper we present BodyMap, a new framework for obtaining high-definition full-body and continuous dense correspondence between in-the-wild images of clothed humans and the s… ▽ More Dense correspondence between humans carries powerful semantic information that can be utilized to solve fundamental problems for full-body understanding such as in-the-wild surface matching, tracking and reconstruction. In this paper we present BodyMap, a new framework for obtaining high-definition full-body and continuous dense correspondence between in-the-wild images of clothed humans and the surface of a 3D template model. The correspondences cover fine details such as hands and hair, while capturing regions far from the body surface, such as loose clothing. Prior methods for estimating such dense surface correspondence i) cut a 3D body into parts which are unwrapped to a 2D UV space, producing discontinuities along part seams, or ii) use a single surface for representing the whole body, but none handled body details. Here, we introduce a novel network architecture with Vision Transformers that learn fine-level features on a continuous body surface. BodyMap outperforms prior work on various metrics and datasets, including DensePose-COCO by a large margin. Furthermore, we show various applications ranging from multi-layer dense cloth correspondence, neural rendering with novel-view synthesis and appearance swap**. △ Less

Submitted 18 May, 2022; originally announced May 2022.

Comments: CVPR 2022 Project Page: https://nsarafianos.github.io/bodymap

arXiv:2204.01218 [pdf, other]

Neural Rendering of Humans in Novel View and Pose from Monocular Video

Authors: Tiantian Wang, Nikolaos Sarafianos, Ming-Hsuan Yang, Tony Tung

Abstract: We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input. Despite the significant progress recently on this topic, with several methods exploring shared canonical neural radiance fields in dynamic scene scenarios, learning a user-controlled model for unseen poses remains a challenging task. To tackle this problem, we introduce an… ▽ More We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input. Despite the significant progress recently on this topic, with several methods exploring shared canonical neural radiance fields in dynamic scene scenarios, learning a user-controlled model for unseen poses remains a challenging task. To tackle this problem, we introduce an effective method to a) integrate observations across several frames and b) encode the appearance at each individual frame. We accomplish this by utilizing both the human pose that models the body shape as well as point clouds that partially cover the human as input. Our approach simultaneously learns a shared set of latent codes anchored to the human pose among several frames, and an appearance-dependent code anchored to incomplete point clouds generated by each frame and its predicted depth. The former human pose-based code models the shape of the performer whereas the latter point cloud-based code predicts fine-level details and reasons about missing structures at the unseen poses. To further recover non-visible regions in query frames, we employ a temporal transformer to integrate features of points in query frames and tracked body points from automatically-selected key frames. Experiments on various sequences of dynamic humans from different datasets including ZJU-MoCap show that our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input. △ Less

Submitted 20 April, 2023; v1 submitted 3 April, 2022; originally announced April 2022.

Comments: 10 pages

arXiv:2201.08141 [pdf, other]

SPAMs: Structured Implicit Parametric Models

Authors: Pablo Palafox, Nikolaos Sarafianos, Tony Tung, Angela Dai

Abstract: Parametric 3D models have formed a fundamental role in modeling deformable objects, such as human bodies, faces, and hands; however, the construction of such parametric models requires significant manual intervention and domain expertise. Recently, neural implicit 3D representations have shown great expressibility in capturing 3D shape geometry. We observe that deformable object motion is often se… ▽ More Parametric 3D models have formed a fundamental role in modeling deformable objects, such as human bodies, faces, and hands; however, the construction of such parametric models requires significant manual intervention and domain expertise. Recently, neural implicit 3D representations have shown great expressibility in capturing 3D shape geometry. We observe that deformable object motion is often semantically structured, and thus propose to learn Structured-implicit PArametric Models (SPAMs) as a deformable object representation that structurally decomposes non-rigid object motion into part-based disentangled representations of shape and pose, with each being represented by deep implicit functions. This enables a structured characterization of object movement, with part decomposition characterizing a lower-dimensional space in which we can establish coarse motion correspondence. In particular, we can leverage the part decompositions at test time to fit to new depth sequences of unobserved shapes, by establishing part correspondences between the input observation and our learned part spaces; this guides a robust joint optimization between the shape and pose of all parts, even under dramatic motion sequences. Experiments demonstrate that our part-aware shape and pose understanding lead to state-of-the-art performance in reconstruction and tracking of depth sequences of complex deforming object motion. We plan to release models to the public at https://pablopalafox.github.io/spams. △ Less

Submitted 20 January, 2022; originally announced January 2022.

Comments: Project page: https://pablopalafox.github.io/spams/ - Video: https://youtu.be/ChdjHNGgrzI

arXiv:2112.13889 [pdf, other]

Free-Viewpoint RGB-D Human Performance Capture and Rendering

Authors: Phong Nguyen-Ha, Nikolaos Sarafianos, Christoph Lassner, Janne Heikkila, Tony Tung

Abstract: Capturing and faithfully rendering photo-realistic humans from novel views is a fundamental problem for AR/VR applications. While prior work has shown impressive performance capture results in laboratory settings, it is non-trivial to achieve casual free-viewpoint human capture and rendering for unseen identities with high fidelity, especially for facial expressions, hands, and clothes. To tackle… ▽ More Capturing and faithfully rendering photo-realistic humans from novel views is a fundamental problem for AR/VR applications. While prior work has shown impressive performance capture results in laboratory settings, it is non-trivial to achieve casual free-viewpoint human capture and rendering for unseen identities with high fidelity, especially for facial expressions, hands, and clothes. To tackle these challenges we introduce a novel view synthesis framework that generates realistic renders from unseen views of any human captured from a single-view and sparse RGB-D sensor, similar to a low-cost depth camera, and without actor-specific models. We propose an architecture to create dense feature maps in novel views obtained by sphere-based neural rendering, and create complete renders using a global context inpainting model. Additionally, an enhancer network leverages the overall fidelity, even in occluded areas from the original view, producing crisp renders with fine details. We show that our method generates high-quality novel views of synthetic and real human actors given a single-stream, sparse RGB-D input. It generalizes to unseen identities, and new poses and faithfully reconstructs facial expressions. Our approach outperforms prior view synthesis methods and is robust to different levels of depth sparsity. △ Less

Submitted 2 August, 2022; v1 submitted 27 December, 2021; originally announced December 2021.

Comments: Accepted at ECCV 2022, Project page: https://www.phongnhhn.info/HVS_Net/index.html

arXiv:2111.13042 [pdf, other]

DeepJSCC-Q: Channel Input Constrained Deep Joint Source-Channel Coding

Authors: Tze-Yang Tung, David Burth Kurka, Mikolaj Jankowski, Deniz Gündüz

Abstract: Recent works have shown that the task of wireless transmission of images can be learned with the use of machine learning techniques. Very promising results in end-to-end image quality, superior to popular digital schemes that utilize source and channel coding separation, have been demonstrated through the training of an autoencoder, with a non-trainable channel layer in the middle. However, these… ▽ More Recent works have shown that the task of wireless transmission of images can be learned with the use of machine learning techniques. Very promising results in end-to-end image quality, superior to popular digital schemes that utilize source and channel coding separation, have been demonstrated through the training of an autoencoder, with a non-trainable channel layer in the middle. However, these methods assume that any complex value can be transmitted over the channel, which can prevent the application of the algorithm in scenarios where the hardware or protocol can only admit certain sets of channel inputs, such as the use of a digital constellation. Herein, we propose DeepJSCC-Q, an end-to-end optimized joint source-channel coding scheme for wireless image transmission, which is able to operate with a fixed channel input alphabet. We show that DeepJSCC-Q can achieve similar performance to models that use continuous-valued channel input. Importantly, it preserves the graceful degradation of image quality observed in prior work when channel conditions worsen, making DeepJSCC-Q much more attractive for deployment in practical systems. △ Less

Submitted 25 November, 2021; originally announced November 2021.

arXiv:2111.13034 [pdf, other]

DeepWiVe: Deep-Learning-Aided Wireless Video Transmission

Authors: Tze-Yang Tung, Deniz Gündüz

Abstract: We present DeepWiVe, the first-ever end-to-end joint source-channel coding (JSCC) video transmission scheme that leverages the power of deep neural networks (DNNs) to directly map video signals to channel symbols, combining video compression, channel coding, and modulation steps into a single neural transform. Our DNN decoder predicts residuals without distortion feedback, which improves video qua… ▽ More We present DeepWiVe, the first-ever end-to-end joint source-channel coding (JSCC) video transmission scheme that leverages the power of deep neural networks (DNNs) to directly map video signals to channel symbols, combining video compression, channel coding, and modulation steps into a single neural transform. Our DNN decoder predicts residuals without distortion feedback, which improves video quality by accounting for occlusion/disocclusion and camera movements. We simultaneously train different bandwidth allocation networks for the frames to allow variable bandwidth transmission. Then, we train a bandwidth allocation network using reinforcement learning (RL) that optimizes the allocation of limited available channel bandwidth among video frames to maximize overall visual quality. Our results show that DeepWiVe can overcome the cliff-effect, which is prevalent in conventional separation-based digital communication schemes, and achieve graceful degradation with the mismatch between the estimated and actual channel qualities. DeepWiVe outperforms H.264 video compression followed by low-density parity check (LDPC) codes in all channel conditions by up to 0.0462 on average in terms of the multi-scale structural similarity index measure (MS-SSIM), while beating H.265 + LDPC by up to 0.0058 on average. We also illustrate the importance of optimizing bandwidth allocation in JSCC video transmission by showing that our optimal bandwidth allocation policy is superior to the naïve uniform allocation. We believe this is an important step towards fulfilling the potential of an end-to-end optimized JSCC wireless video transmission system that is superior to the current separation-based designs. △ Less

Submitted 25 November, 2021; originally announced November 2021.

arXiv:2108.08807 [pdf, other]

Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing

Authors: Garvita Tiwari, Nikolaos Sarafianos, Tony Tung, Gerard Pons-Moll

Abstract: We present Neural Generalized Implicit Functions(Neural-GIF), to animate people in clothing as a function of the body pose. Given a sequence of scans of a subject in various poses, we learn to animate the character for new poses. Existing methods have relied on template-based representations of the human body (or clothing). However such models usually have fixed and limited resolutions, require di… ▽ More We present Neural Generalized Implicit Functions(Neural-GIF), to animate people in clothing as a function of the body pose. Given a sequence of scans of a subject in various poses, we learn to animate the character for new poses. Existing methods have relied on template-based representations of the human body (or clothing). However such models usually have fixed and limited resolutions, require difficult data pre-processing steps and cannot be used with complex clothing. We draw inspiration from template-based methods, which factorize motion into articulation and non-rigid deformation, but generalize this concept for implicit shape learning to obtain a more flexible model. We learn to map every point in the space to a canonical space, where a learned deformation field is applied to model non-rigid effects, before evaluating the signed distance field. Our formulation allows the learning of complex and non-rigid deformations of clothing and soft tissue, without computing a template registration as it is common with current approaches. Neural-GIF can be trained on raw 3D scans and reconstructs detailed complex surface geometry and deformations. Moreover, the model can generalize to new poses. We evaluate our method on a variety of characters from different public datasets in diverse clothing styles and show significant improvements over baseline methods, quantitatively and qualitatively. We also extend our model to multiple shape setting. To stimulate further research, we will make the model, code and data publicly available at: https://virtualhumans.mpi-inf.mpg.de/neuralgif/ △ Less

Submitted 20 August, 2021; v1 submitted 19 August, 2021; originally announced August 2021.

arXiv:2108.07845 [pdf, other]

ARCH++: Animation-Ready Clothed Human Reconstruction Revisited

Authors: Tong He, Yuanlu Xu, Shunsuke Saito, Stefano Soatto, Tony Tung

Abstract: We present ARCH++, an image-based method to reconstruct 3D avatars with arbitrary clothing styles. Our reconstructed avatars are animation-ready and highly realistic, in both the visible regions from input views and the unseen regions. While prior work shows great promise of reconstructing animatable clothed humans with various topologies, we observe that there exist fundamental limitations result… ▽ More We present ARCH++, an image-based method to reconstruct 3D avatars with arbitrary clothing styles. Our reconstructed avatars are animation-ready and highly realistic, in both the visible regions from input views and the unseen regions. While prior work shows great promise of reconstructing animatable clothed humans with various topologies, we observe that there exist fundamental limitations resulting in sub-optimal reconstruction quality. In this paper, we revisit the major steps of image-based avatar reconstruction and address the limitations with ARCH++. First, we introduce an end-to-end point based geometry encoder to better describe the semantics of the underlying 3D human body, in replacement of previous hand-crafted features. Second, in order to address the occupancy ambiguity caused by topological changes of clothed humans in the canonical pose, we propose a co-supervising framework with cross-space consistency to jointly estimate the occupancy in both the posed and canonical spaces. Last, we use image-to-image translation networks to further refine detailed geometry and texture on the reconstructed surface, which improves the fidelity and consistency across arbitrary viewpoints. In the experiments, we demonstrate improvements over the state of the art on both public benchmarks and user studies in reconstruction quality and realism. △ Less

Submitted 25 February, 2022; v1 submitted 17 August, 2021; originally announced August 2021.

Comments: published at ICCV 2021, project page: https://tonghehehe.com/archpp

arXiv:2104.08013 [pdf, other]

Data-Driven 3D Reconstruction of Dressed Humans From Sparse Views

Authors: Pierre Zins, Yuanlu Xu, Edmond Boyer, Stefanie Wuhrer, Tony Tung

Abstract: Recently, data-driven single-view reconstruction methods have shown great progress in modeling 3D dressed humans. However, such methods suffer heavily from depth ambiguities and occlusions inherent to single view inputs. In this paper, we tackle this problem by considering a small set of input views and investigate the best strategy to suitably exploit information from these views. We propose a da… ▽ More Recently, data-driven single-view reconstruction methods have shown great progress in modeling 3D dressed humans. However, such methods suffer heavily from depth ambiguities and occlusions inherent to single view inputs. In this paper, we tackle this problem by considering a small set of input views and investigate the best strategy to suitably exploit information from these views. We propose a data-driven end-to-end approach that reconstructs an implicit 3D representation of dressed humans from sparse camera views. Specifically, we introduce three key components: first a spatially consistent reconstruction that allows for arbitrary placement of the person in the input views using a perspective camera model; second an attention-based fusion layer that learns to aggregate visual information from several viewpoints; and third a mechanism that encodes local 3D patterns under the multi-view context. In the experiments, we show the proposed approach outperforms the state of the art on standard data both quantitatively and qualitatively. To demonstrate the spatially consistent reconstruction, we apply our approach to dynamic scenes. Additionally, we apply our method on real data acquired with a multi-camera platform and demonstrate our approach can obtain results comparable to multi-view stereo with dramatically less views. △ Less

Submitted 5 December, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: Presented at 3DV 2021. Code is released at https://gitlab.inria.fr/pzins/data-driven-3d-reconstruction-of-dressed-humans-from-sparse-views/

Journal ref: 3DV 2021

arXiv:2103.17266 [pdf, other]

Semi-supervised Synthesis of High-Resolution Editable Textures for 3D Humans

Authors: Bindita Chaudhuri, Nikolaos Sarafianos, Linda Shapiro, Tony Tung

Abstract: We introduce a novel approach to generate diverse high fidelity texture maps for 3D human meshes in a semi-supervised setup. Given a segmentation mask defining the layout of the semantic regions in the texture map, our network generates high-resolution textures with a variety of styles, that are then used for rendering purposes. To accomplish this task, we propose a Region-adaptive Adversarial Var… ▽ More We introduce a novel approach to generate diverse high fidelity texture maps for 3D human meshes in a semi-supervised setup. Given a segmentation mask defining the layout of the semantic regions in the texture map, our network generates high-resolution textures with a variety of styles, that are then used for rendering purposes. To accomplish this task, we propose a Region-adaptive Adversarial Variational AutoEncoder (ReAVAE) that learns the probability distribution of the style of each region individually so that the style of the generated texture can be controlled by sampling from the region-specific distributions. In addition, we introduce a data generation technique to augment our training set with data lifted from single-view RGB inputs. Our training strategy allows the mixing of reference image styles with arbitrary styles for different regions, a property which can be valuable for virtual try-on AR/VR applications. Experimental results show that our method synthesizes better texture maps compared to prior work while enabling independent layout and style controllability. △ Less

Submitted 31 March, 2021; originally announced March 2021.

Comments: CVPR 2021

arXiv:2102.02802 [pdf, other]

Federated mmWave Beam Selection Utilizing LIDAR Data

Authors: Mahdi Boloursaz Mashhadi, Mikolaj Jankowski, Tze-Yang Tung, Szymon Kobus, Deniz Gunduz

Abstract: Efficient link configuration in millimeter wave (mmWave) communication systems is a crucial yet challenging task due to the overhead imposed by beam selection. For vehicle-to-infrastructure (V2I) networks, side information from LIDAR sensors mounted on the vehicles has been leveraged to reduce the beam search overhead. In this letter, we propose a federated LIDAR aided beam selection method for V2… ▽ More Efficient link configuration in millimeter wave (mmWave) communication systems is a crucial yet challenging task due to the overhead imposed by beam selection. For vehicle-to-infrastructure (V2I) networks, side information from LIDAR sensors mounted on the vehicles has been leveraged to reduce the beam search overhead. In this letter, we propose a federated LIDAR aided beam selection method for V2I mmWave communication systems. In the proposed scheme, connected vehicles collaborate to train a shared neural network (NN) on their locally available LIDAR data during normal operation of the system. We also propose a reduced-complexity convolutional NN (CNN) classifier architecture and LIDAR preprocessing, which significantly outperforms previous works in terms of both the performance and the complexity. △ Less

Submitted 25 July, 2021; v1 submitted 4 February, 2021; originally announced February 2021.

arXiv:2101.10369 [pdf, other]

Effective Communications: A Joint Learning and Communication Framework for Multi-Agent Reinforcement Learning over Noisy Channels

Authors: Tze-Yang Tung, Szymon Kobus, Joan Roig Pujol, Deniz Gunduz

Abstract: We propose a novel formulation of the "effectiveness problem" in communications, put forth by Shannon and Weaver in their seminal work [2], by considering multiple agents communicating over a noisy channel in order to achieve better coordination and cooperation in a multi-agent reinforcement learning (MARL) framework. Specifically, we consider a multi-agent partially observable Markov decision pro… ▽ More We propose a novel formulation of the "effectiveness problem" in communications, put forth by Shannon and Weaver in their seminal work [2], by considering multiple agents communicating over a noisy channel in order to achieve better coordination and cooperation in a multi-agent reinforcement learning (MARL) framework. Specifically, we consider a multi-agent partially observable Markov decision process (MA-POMDP), in which the agents, in addition to interacting with the environment can also communicate with each other over a noisy communication channel. The noisy communication channel is considered explicitly as part of the dynamics of the environment and the message each agent sends is part of the action that the agent can take. As a result, the agents learn not only to collaborate with each other but also to communicate "effectively" over a noisy channel. This framework generalizes both the traditional communication problem, where the main goal is to convey a message reliably over a noisy channel, and the "learning to communicate" framework that has received recent attention in the MARL literature, where the underlying communication channels are assumed to be error-free. We show via examples that the joint policy learned using the proposed framework is superior to that where the communication is considered separately from the underlying MA-POMDP. This is a very powerful framework, which has many real world applications, from autonomous vehicle planning to drone swarm control, and opens up the rich toolbox of deep reinforcement learning for the design of multi-user communication systems. △ Less

Submitted 1 April, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

arXiv:2011.00073 [pdf, other]

Resource-Aware Pareto-Optimal Automated Machine Learning Platform

Authors: Yao Yang, Andrew Nam, Mohamad M. Nasr-Azadani, Teresa Tung

Abstract: In this study, we introduce a novel platform Resource-Aware AutoML (RA-AutoML) which enables flexible and generalized algorithms to build machine learning models subjected to multiple objectives, as well as resource and hard-ware constraints. RA-AutoML intelligently conducts Hyper-Parameter Search(HPS) as well as Neural Architecture Search (NAS) to build models optimizing predefined objectives. RA… ▽ More In this study, we introduce a novel platform Resource-Aware AutoML (RA-AutoML) which enables flexible and generalized algorithms to build machine learning models subjected to multiple objectives, as well as resource and hard-ware constraints. RA-AutoML intelligently conducts Hyper-Parameter Search(HPS) as well as Neural Architecture Search (NAS) to build models optimizing predefined objectives. RA-AutoML is a versatile framework that allows user to prescribe many resource/hardware constraints along with objectives demanded by the problem at hand or business requirements. At its core, RA-AutoML relies on our in-house search-engine algorithm,MOBOGA, which combines a modified constraint-aware Bayesian Optimization and Genetic Algorithm to construct Pareto optimal candidates. Our experiments on CIFAR-10 dataset shows very good accuracy compared to results obtained by state-of-art neural network models, while subjected to resource constraints in the form of model size. △ Less

Submitted 30 October, 2020; originally announced November 2020.

Comments: Accepted for International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), IEEE. December 2020

ACM Class: F.m; I.2

arXiv:2008.00158 [pdf, ps, other]

TexMesh: Reconstructing Detailed Human Texture and Geometry from RGB-D Video

Authors: Tiancheng Zhi, Christoph Lassner, Tony Tung, Carsten Stoll, Srinivasa G. Narasimhan, Minh Vo

Abstract: We present TexMesh, a novel approach to reconstruct detailed human meshes with high-resolution full-body texture from RGB-D video. TexMesh enables high quality free-viewpoint rendering of humans. Given the RGB frames, the captured environment map, and the coarse per-frame human mesh from RGB-D tracking, our method reconstructs spatiotemporally consistent and detailed per-frame meshes along with a… ▽ More We present TexMesh, a novel approach to reconstruct detailed human meshes with high-resolution full-body texture from RGB-D video. TexMesh enables high quality free-viewpoint rendering of humans. Given the RGB frames, the captured environment map, and the coarse per-frame human mesh from RGB-D tracking, our method reconstructs spatiotemporally consistent and detailed per-frame meshes along with a high-resolution albedo texture. By using the incident illumination we are able to accurately estimate local surface geometry and albedo, which allows us to further use photometric constraints to adapt a synthetically trained model to real-world sequences in a self-supervised manner for detailed surface geometry and high-resolution texture estimation. In practice, we train our models on a short example sequence for self-adaptation and the model runs at interactive framerate afterwards. We validate TexMesh on synthetic and real-world data, and show it outperforms the state of art quantitatively and qualitatively. △ Less

Submitted 20 September, 2020; v1 submitted 31 July, 2020; originally announced August 2020.

Comments: ECCV 2020

arXiv:2007.11610 [pdf, other]

SIZER: A Dataset and Model for Parsing 3D Clothing and Learning Size Sensitive 3D Clothing

Authors: Garvita Tiwari, Bharat Lal Bhatnagar, Tony Tung, Gerard Pons-Moll

Abstract: While models of 3D clothing learned from real data exist, no method can predict clothing deformation as a function of garment size. In this paper, we introduce SizerNet to predict 3D clothing conditioned on human body shape and garment size parameters, and ParserNet to infer garment meshes and shape under clothing with personal details in a single pass from an input mesh. SizerNet allows to estima… ▽ More While models of 3D clothing learned from real data exist, no method can predict clothing deformation as a function of garment size. In this paper, we introduce SizerNet to predict 3D clothing conditioned on human body shape and garment size parameters, and ParserNet to infer garment meshes and shape under clothing with personal details in a single pass from an input mesh. SizerNet allows to estimate and visualize the dressing effect of a garment in various sizes, and ParserNet allows to edit clothing of an input mesh directly, removing the need for scan segmentation, which is a challenging problem in itself. To learn these models, we introduce the SIZER dataset of clothing size variation which includes $100$ different subjects wearing casual clothing items in various sizes, totaling to approximately 2000 scans. This dataset includes the scans, registrations to the SMPL model, scans segmented in clothing parts, garment category and size labels. Our experiments show better parsing accuracy and size prediction than baseline methods trained on SIZER. The code, model and dataset will be released for research purposes. △ Less

Submitted 22 July, 2020; originally announced July 2020.

Comments: European Conference on Computer Vision 2020

arXiv:2004.04572 [pdf, other]

ARCH: Animatable Reconstruction of Clothed Humans

Authors: Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, Tony Tung

Abstract: In this paper, we propose ARCH (Animatable Reconstruction of Clothed Humans), a novel end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. Existing approaches to digitize 3D humans struggle to handle pose variations and recover details. Also, they do not produce models that are animation ready. In contrast, ARCH is a learned pose-aware model… ▽ More In this paper, we propose ARCH (Animatable Reconstruction of Clothed Humans), a novel end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. Existing approaches to digitize 3D humans struggle to handle pose variations and recover details. Also, they do not produce models that are animation ready. In contrast, ARCH is a learned pose-aware model that produces detailed 3D rigged full-body human avatars from a single unconstrained RGB image. A Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator. They allow the transformation of 2D/3D clothed humans into a canonical space, reducing ambiguities in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and appearance are learned using an implicit function representation with spatial local features. Furthermore, we propose additional per-pixel supervision on the 3D reconstruction using opacity-aware differentiable rendering. Our experiments indicate that ARCH increases the fidelity of the reconstructed humans. We obtain more than 50% lower reconstruction errors for standard metrics compared to state-of-the-art methods on public datasets. We also show numerous qualitative examples of animated, high-quality reconstructed avatars unseen in the literature so far. △ Less

Submitted 10 April, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

Comments: 10 pages, 10 figures, CVPR2020

arXiv:1910.00116 [pdf, other]

DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare

Authors: Yuanlu Xu, Song-Chun Zhu, Tony Tung

Abstract: We present DenseRaC, a novel end-to-end framework for jointly estimating 3D human pose and body shape from a monocular RGB image. Our two-step framework takes the body pixel-to-surface correspondence map (i.e., IUV map) as proxy representation and then performs estimation of parameterized human pose and shape. Specifically, given an estimated IUV map, we develop a deep neural network optimizing 3D… ▽ More We present DenseRaC, a novel end-to-end framework for jointly estimating 3D human pose and body shape from a monocular RGB image. Our two-step framework takes the body pixel-to-surface correspondence map (i.e., IUV map) as proxy representation and then performs estimation of parameterized human pose and shape. Specifically, given an estimated IUV map, we develop a deep neural network optimizing 3D body reconstruction losses and further integrating a render-and-compare scheme to minimize differences between the input and the rendered output, i.e., dense body landmarks, body part masks, and adversarial priors. To boost learning, we further construct a large-scale synthetic dataset (MOCA) utilizing web-crawled Mocap sequences, 3D scans and animations. The generated data covers diversified camera views, human actions and body shapes, and is paired with full ground truth. Our model jointly learns to represent the 3D human body from hybrid datasets, mitigating the problem of unpaired training data. Our experiments show that DenseRaC obtains superior performance against state of the art on public benchmarks of various humanrelated tasks. △ Less

Submitted 9 October, 2019; v1 submitted 30 September, 2019; originally announced October 2019.

Comments: 11 pages, 8 figures, International Conference on Computer Vision (ICCV) 2019, Oral Presentation

arXiv:1811.10079 [pdf, ps, other]

doi 10.1109/LCOMM.2018.2877316

SparseCast: Hybrid Digital-Analog Wireless Image Transmission Exploiting Frequency Domain Sparsity

Authors: Tze-Yang Tung, Deniz Gündüz

Abstract: A hybrid digital-analog wireless image transmission scheme, called SparseCast, is introduced, which provides graceful degradation with channel quality. SparseCast achieves improved end-to-end reconstruction quality while reducing the bandwidth requirement by exploiting frequency domain sparsity through compressed sensing. The proposed algorithm produces a linear relationship between the channel si… ▽ More A hybrid digital-analog wireless image transmission scheme, called SparseCast, is introduced, which provides graceful degradation with channel quality. SparseCast achieves improved end-to-end reconstruction quality while reducing the bandwidth requirement by exploiting frequency domain sparsity through compressed sensing. The proposed algorithm produces a linear relationship between the channel signal-to-noise ratio (CSNR) and peak signal-to-noise ratio (PSNR), without requiring the channel state knowledge at the transmitter. This is particularly attractive when transmitting to multiple receivers or over unknown time-varying channels, as the receiver PSNR depends on the experienced channel quality, and is not bottlenecked by the worst channel. SparseCast is benchmarked against two alternative algorithms: SoftCast and BCS-SPL. Our findings show that the proposed algorithm outperforms SoftCast by approximately 3.5 dB and BCS-SPL by 15.2 dB. △ Less

Submitted 25 November, 2018; originally announced November 2018.

Comments: This paper is accepted to appear in IEEE Communications Letters

arXiv:1808.03417 [pdf, other]

DeepWrinkles: Accurate and Realistic Clothing Modeling

Authors: Zorah Laehner, Daniel Cremers, Tony Tung

Abstract: We present a novel method to generate accurate and realistic clothing deformation from real data capture. Previous methods for realistic cloth modeling mainly rely on intensive computation of physics-based simulation (with numerous heuristic parameters), while models reconstructed from visual observations typically suffer from lack of geometric details. Here, we propose an original framework consi… ▽ More We present a novel method to generate accurate and realistic clothing deformation from real data capture. Previous methods for realistic cloth modeling mainly rely on intensive computation of physics-based simulation (with numerous heuristic parameters), while models reconstructed from visual observations typically suffer from lack of geometric details. Here, we propose an original framework consisting of two modules that work jointly to represent global shape deformation as well as surface details with high fidelity. Global shape deformations are recovered from a subspace model learned from 3D data of clothed people in motion, while high frequency details are added to normal maps created using a conditional Generative Adversarial Network whose architecture is designed to enforce realism and temporal consistency. This leads to unprecedented high-quality rendering of clothing deformation sequences, where fine wrinkles from (real) high resolution observations can be recovered. In addition, as the model is learned independently from body shape and pose, the framework is suitable for applications that require retargeting (e.g., body animation). Our experiments show original high quality results with a flexible model. We claim an entirely data-driven approach to realistic cloth wrinkle generation is possible. △ Less

Submitted 10 August, 2018; originally announced August 2018.

Comments: 18 pages, 12 figures, 15th European Conference on Computer Vision (ECCV) 2018, Oral Presentation

arXiv:1609.09270 [pdf, other]

Pano2CAD: Room Layout From A Single Panorama Image

Authors: Jiu Xu, Bjorn Stenger, Tommi Kerola, Tony Tung

Abstract: This paper presents a method of estimating the geometry of a room and the 3D pose of objects from a single 360-degree panorama image. Assuming Manhattan World geometry, we formulate the task as a Bayesian inference problem in which we estimate positions and orientations of walls and objects. The method combines surface normal estimation, 2D object detection and 3D object pose estimation. Quantitat… ▽ More This paper presents a method of estimating the geometry of a room and the 3D pose of objects from a single 360-degree panorama image. Assuming Manhattan World geometry, we formulate the task as a Bayesian inference problem in which we estimate positions and orientations of walls and objects. The method combines surface normal estimation, 2D object detection and 3D object pose estimation. Quantitative results are presented on a dataset of synthetically generated 3D rooms containing objects, as well as on a subset of hand-labeled images from the public SUN360 dataset. △ Less

Submitted 30 September, 2016; v1 submitted 29 September, 2016; originally announced September 2016.

Showing 1–33 of 33 results for author: Tung, T