Search | arXiv e-print repository

Learning multiplane images from single views with self-supervision

Authors: Gustavo Sutter P. Carvalho, Diogo C. Luvizon, Antonio Joia Neto, Andre G. C. Pacheco, Otavio A. B. Penatti

Abstract: Generating static novel views from an already captured image is a hard task in computer vision and graphics, in particular when the single input image has dynamic parts such as persons or moving objects. In this paper, we tackle this problem by proposing a new framework, called CycleMPI, that is capable of learning a multiplane image representation from single images through a cyclic training stra… ▽ More Generating static novel views from an already captured image is a hard task in computer vision and graphics, in particular when the single input image has dynamic parts such as persons or moving objects. In this paper, we tackle this problem by proposing a new framework, called CycleMPI, that is capable of learning a multiplane image representation from single images through a cyclic training strategy for self-supervision. Our framework does not require stereo data for training, therefore it can be trained with massive visual data from the Internet, resulting in a better generalization capability even for very challenging cases. Although our method does not require stereo data for supervision, it reaches results on stereo datasets comparable to the state of the art in a zero-shot scenario. We evaluated our method on RealEstate10K and Mannequin Challenge datasets for view synthesis and presented qualitative results on Places II dataset. △ Less

Submitted 19 October, 2021; v1 submitted 18 October, 2021; originally announced October 2021.

Comments: To appear on BMVC 2021

arXiv:2102.11771 [pdf, ps, other]

Improving Deep Learning Sound Events Classifiers using Gram Matrix Feature-wise Correlations

Authors: Antonio Joia Neto, Andre G C Pacheco, Diogo C Luvizon

Abstract: In this paper, we propose a new Sound Event Classification (SEC) method which is inspired in recent works for out-of-distribution detection. In our method, we analyse all the activations of a generic CNN in order to produce feature representations using Gram Matrices. The similarity metrics are evaluated considering all possible classes, and the final prediction is defined as the class that minimi… ▽ More In this paper, we propose a new Sound Event Classification (SEC) method which is inspired in recent works for out-of-distribution detection. In our method, we analyse all the activations of a generic CNN in order to produce feature representations using Gram Matrices. The similarity metrics are evaluated considering all possible classes, and the final prediction is defined as the class that minimizes the deviation with respect to the features seeing during training. The proposed approach can be applied to any CNN and our experimental evaluation of four different architectures on two datasets demonstrated that our method consistently improves the baseline models. △ Less

Submitted 23 February, 2021; originally announced February 2021.

Comments: To appear on ICASSP 2021

arXiv:2011.13317 [pdf, other]

Adaptive Multiplane Image Generation from a Single Internet Picture

Authors: Diogo C. Luvizon, Gustavo Sutter P. Carvalho, Andreza A. dos Santos, Jhonatas S. Conceicao, Jose L. Flores-Campana, Luis G. L. Decker, Marcos R. Souza, Helio Pedrini, Antonio Joia, Otavio A. B. Penatti

Abstract: In the last few years, several works have tackled the problem of novel view synthesis from stereo images or even from a single picture. However, previous methods are computationally expensive, specially for high-resolution images. In this paper, we address the problem of generating a multiplane image (MPI) from a single high-resolution picture. We present the adaptive-MPI representation, which all… ▽ More In the last few years, several works have tackled the problem of novel view synthesis from stereo images or even from a single picture. However, previous methods are computationally expensive, specially for high-resolution images. In this paper, we address the problem of generating a multiplane image (MPI) from a single high-resolution picture. We present the adaptive-MPI representation, which allows rendering novel views with low computational requirements. To this end, we propose an adaptive slicing algorithm that produces an MPI with a variable number of image planes. We present a new lightweight CNN for depth estimation, which is learned by knowledge distillation from a larger network. Occluded regions in the adaptive-MPI are inpainted also by a lightweight CNN. We show that our method is capable of producing high-quality predictions with one order of magnitude less parameters compared to previous approaches. The robustness of our method is evidenced on challenging pictures from the Internet. △ Less

Submitted 26 November, 2020; originally announced November 2020.

arXiv:2010.02680 [pdf, other]

doi 10.1109/ICIP40778.2020.9191168

Parallax Motion Effect Generation Through Instance Segmentation And Depth Estimation

Authors: Allan Pinto, Manuel A. Córdova, Luis G. L. Decker, Jose L. Flores-Campana, Marcos R. Souza, Andreza A. dos Santos, Jhonatas S. Conceição, Henrique F. Gagliardi, Diogo C. Luvizon, Ricardo da S. Torres, Helio Pedrini

Abstract: Stereo vision is a growing topic in computer vision due to the innumerable opportunities and applications this technology offers for the development of modern solutions, such as virtual and augmented reality applications. To enhance the user's experience in three-dimensional virtual environments, the motion parallax estimation is a promising technique to achieve this objective. In this paper, we p… ▽ More Stereo vision is a growing topic in computer vision due to the innumerable opportunities and applications this technology offers for the development of modern solutions, such as virtual and augmented reality applications. To enhance the user's experience in three-dimensional virtual environments, the motion parallax estimation is a promising technique to achieve this objective. In this paper, we propose an algorithm for generating parallax motion effects from a single image, taking advantage of state-of-the-art instance segmentation and depth estimation approaches. This work also presents a comparison against such algorithms to investigate the trade-off between efficiency and quality of the parallax motion effects, taking into consideration a multi-task learning network capable of estimating instance segmentation and depth estimation at once. Experimental results and visual quality assessment indicate that the PyD-Net network (depth estimation) combined with Mask R-CNN or FBNet networks (instance segmentation) can produce parallax motion effects with good visual quality. △ Less

Submitted 6 October, 2020; originally announced October 2020.

Comments: 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates

Journal ref: 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2020, pp. 1621-1625

arXiv:1912.08077 [pdf, other]

doi 10.1109/TPAMI.2020.2976014

Multi-task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition

Authors: Diogo C Luvizon, Hedi Tabia, David Picard

Abstract: Human pose estimation and action recognition are related tasks since both problems are strongly dependent on the human body representation and analysis. Nonetheless, most recent methods in the literature handle the two problems separately. In this work, we propose a multi-task framework for jointly estimating 2D or 3D human poses from monocular color images and classifying human actions from video… ▽ More Human pose estimation and action recognition are related tasks since both problems are strongly dependent on the human body representation and analysis. Nonetheless, most recent methods in the literature handle the two problems separately. In this work, we propose a multi-task framework for jointly estimating 2D or 3D human poses from monocular color images and classifying human actions from video sequences. We show that a single architecture can be used to solve both problems in an efficient way and still achieves state-of-the-art or comparable results at each task while running at more than 100 frames per second. The proposed method benefits from high parameters sharing between the two tasks by unifying still images and video clips processing in a single pipeline, allowing the model to be trained with data from different categories simultaneously and in a seamlessly way. Additionally, we provide important insights for end-to-end training the proposed multi-task model by decoupling key prediction parts, which consistently leads to better accuracy on both tasks. The reported results on four datasets (MPII, Human3.6M, Penn Action and NTU RGB+D) demonstrate the effectiveness of our method on the targeted tasks. Our source code and trained weights are publicly available at https://github.com/dluvizon/deephar. △ Less

Submitted 3 March, 2020; v1 submitted 14 December, 2019; originally announced December 2019.

Comments: Accepted to TPAMI. arXiv admin note: text overlap with arXiv:1802.09232

arXiv:1911.09245 [pdf, other]

Consensus-based Optimization for 3D Human Pose Estimation in Camera Coordinates

Authors: Diogo C Luvizon, Hedi Tabia, David Picard

Abstract: 3D human pose estimation is frequently seen as the task of estimating 3D poses relative to the root body joint. Alternatively, we propose a 3D human pose estimation method in camera coordinates, which allows effective combination of 2D annotated data and 3D poses and a straightforward multi-view generalization. To that end, we cast the problem as a view frustum space pose estimation, where absolut… ▽ More 3D human pose estimation is frequently seen as the task of estimating 3D poses relative to the root body joint. Alternatively, we propose a 3D human pose estimation method in camera coordinates, which allows effective combination of 2D annotated data and 3D poses and a straightforward multi-view generalization. To that end, we cast the problem as a view frustum space pose estimation, where absolute depth prediction and joint relative depth estimations are disentangled. Final 3D predictions are obtained in camera coordinates by the inverse camera projection. Based on this, we also present a consensus-based optimization algorithm for multi-view predictions from uncalibrated images, which requires a single monocular training procedure. Although our method is indirectly tied to the training camera intrinsics, it still converges for cameras with different intrinsic parameters, resulting in coherent estimations up to a scale factor. Our method improves the state of the art on well known 3D human pose datasets, reducing the prediction error by 32% in the most common benchmark. We also reported our results in absolute pose position error, achieving 80~mm for monocular estimations and 51~mm for multi-view, on average. △ Less

Submitted 20 August, 2021; v1 submitted 20 November, 2019; originally announced November 2019.

Comments: Source code is available at https://github.com/dluvizon/3d-pose-consensus

arXiv:1802.09232 [pdf, other]

2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning

Authors: Diogo C. Luvizon, David Picard, Hedi Tabia

Abstract: Action recognition and human pose estimation are closely related but both problems are generally handled as distinct tasks in the literature. In this work, we propose a multitask framework for jointly 2D and 3D pose estimation from still images and human action recognition from video sequences. We show that a single architecture can be used to solve the two problems in an efficient way and still a… ▽ More Action recognition and human pose estimation are closely related but both problems are generally handled as distinct tasks in the literature. In this work, we propose a multitask framework for jointly 2D and 3D pose estimation from still images and human action recognition from video sequences. We show that a single architecture can be used to solve the two problems in an efficient way and still achieves state-of-the-art results. Additionally, we demonstrate that optimization from end-to-end leads to significantly higher accuracy than separated learning. The proposed architecture can be trained with data from different categories simultaneously in a seamlessly way. The reported results on four datasets (MPII, Human3.6M, Penn Action and NTU) demonstrate the effectiveness of our method on the targeted tasks. △ Less

Submitted 21 March, 2018; v1 submitted 26 February, 2018; originally announced February 2018.

Comments: To appear in CVPR 2018

arXiv:1710.02322 [pdf, other]

Human Pose Regression by Combining Indirect Part Detection and Contextual Information

Authors: Diogo C. Luvizon, Hedi Tabia, David Picard

Abstract: In this paper, we propose an end-to-end trainable regression approach for human pose estimation from still images. We use the proposed Soft-argmax function to convert feature maps directly to joint coordinates, resulting in a fully differentiable framework. Our method is able to learn heat maps representations indirectly, without additional steps of artificial ground truth generation. Consequently… ▽ More In this paper, we propose an end-to-end trainable regression approach for human pose estimation from still images. We use the proposed Soft-argmax function to convert feature maps directly to joint coordinates, resulting in a fully differentiable framework. Our method is able to learn heat maps representations indirectly, without additional steps of artificial ground truth generation. Consequently, contextual information can be included to the pose predictions in a seamless way. We evaluated our method on two very challenging datasets, the Leeds Sports Poses (LSP) and the MPII Human Pose datasets, reaching the best performance among all the existing regression methods and comparable results to the state-of-the-art detection based approaches. △ Less

Submitted 6 October, 2017; originally announced October 2017.

Showing 1–8 of 8 results for author: Luvizon, D C