Search | arXiv e-print repository

Improving the Robustness of 3D Human Pose Estimation: A Benchmark and Learning from Noisy Input

Authors: Trung-Hieu Hoang, Mona Zehni, Huy Phan, Duc Minh Vo, Minh N. Do

Abstract: Despite the promising performance of current 3D human pose estimation techniques, understanding and enhancing their generalization on challenging in-the-wild videos remain an open problem. In this work, we focus on the robustness of 2D-to-3D pose lifters. To this end, we develop two benchmark datasets, namely Human3.6M-C and HumanEva-I-C, to examine the robustness of video-based 3D pose lifters to… ▽ More Despite the promising performance of current 3D human pose estimation techniques, understanding and enhancing their generalization on challenging in-the-wild videos remain an open problem. In this work, we focus on the robustness of 2D-to-3D pose lifters. To this end, we develop two benchmark datasets, namely Human3.6M-C and HumanEva-I-C, to examine the robustness of video-based 3D pose lifters to a wide range of common video corruptions including temporary occlusion, motion blur, and pixel-level noise. We observe the poor generalization of state-of-the-art 3D pose lifters in the presence of corruption and establish two techniques to tackle this issue. First, we introduce Temporal Additive Gaussian Noise (TAGN) as a simple yet effective 2D input pose data augmentation. Additionally, to incorporate the confidence scores output by the 2D pose detectors, we design a confidence-aware convolution (CA-Conv) block. Extensively tested on corrupted videos, the proposed strategies consistently boost the robustness of 3D pose lifters and serve as new baselines for future research. △ Less

Submitted 15 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

arXiv:2311.18193 [pdf, other]

Persistent Test-time Adaptation in Episodic Testing Scenarios

Authors: Trung-Hieu Hoang, Duc Minh Vo, Minh N. Do

Abstract: Current test-time adaptation (TTA) approaches aim to adapt to environments that change continuously. Yet, when the environments not only change but also recur in a correlated manner over time, such as in the case of day-night surveillance cameras, it is unclear whether the adaptability of these methods is sustained after a long run. This study aims to examine the error accumulation of TTA models w… ▽ More Current test-time adaptation (TTA) approaches aim to adapt to environments that change continuously. Yet, when the environments not only change but also recur in a correlated manner over time, such as in the case of day-night surveillance cameras, it is unclear whether the adaptability of these methods is sustained after a long run. This study aims to examine the error accumulation of TTA models when they are repeatedly exposed to previous testing environments, proposing a novel testing setting called episodic TTA. To study this phenomenon, we design a simulation of TTA process on a simple yet representative $ε$-perturbed Gaussian Mixture Model Classifier and derive the theoretical findings revealing the dataset- and algorithm-dependent factors that contribute to the gradual degeneration of TTA methods through time. Our investigation has led us to propose a method, named persistent TTA (PeTTA). PeTTA senses the model divergence towards a collapsing and adjusts the adaptation strategy of TTA, striking a balance between two primary objectives: adaptation and preventing model collapse. The stability of PeTTA in the face of episodic TTA scenarios has been demonstrated through a set of comprehensive experiments on various benchmarks. △ Less

Submitted 16 January, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.15879 [pdf, other]

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

Authors: Jiaxuan Li, Duc Minh Vo, Akihiro Sugimoto, Hideki Nakayama

Abstract: Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and/or scaling up network parameters, we introduce a highly effective retrieval-aug… ▽ More Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and/or scaling up network parameters, we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal cost and (ii) effortlessly augment LLMs with retrieved object names by utilizing a lightweight and fast-to-train model. Our model, which was trained only on the COCO dataset, can adapt to out-of-domain without requiring additional fine-tuning or re-training. Our experiments conducted on benchmarks and synthetic commonsense-violating data show that EVCap, with only 3.97M trainable parameters, exhibits superior performance compared to other methods based on frozen pre-trained LLMs. Its performance is also competitive to specialist SOTAs that require extensive training. △ Less

Submitted 7 April, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

Comments: CVPR 2024

arXiv:2311.12897 [pdf, other]

A Compact Dynamic 3D Gaussian Representation for Real-Time Dynamic View Synthesis

Authors: Kai Katsumata, Duc Minh Vo, Hideki Nakayama

Abstract: 3D Gaussian Splatting (3DGS) has shown remarkable success in synthesizing novel views given multiple views of a static scene. Yet, 3DGS faces challenges when applied to dynamic scenes because 3D Gaussian parameters need to be updated per timestep, requiring a large amount of memory and at least a dozen observations per timestep. To address these limitations, we present a compact dynamic 3D Gaussia… ▽ More 3D Gaussian Splatting (3DGS) has shown remarkable success in synthesizing novel views given multiple views of a static scene. Yet, 3DGS faces challenges when applied to dynamic scenes because 3D Gaussian parameters need to be updated per timestep, requiring a large amount of memory and at least a dozen observations per timestep. To address these limitations, we present a compact dynamic 3D Gaussian representation that models positions and rotations as functions of time with a few parameter approximations while kee** other properties of 3DGS including scale, color and opacity invariant. Our method can dramatically reduce memory usage and relax a strict multi-view assumption. In our experiments on monocular and multi-view scenarios, we show that our method not only matches state-of-the-art methods, often linked with slower rendering speeds, in terms of high rendering quality but also significantly surpasses them by achieving a rendering speed of $118$ frames per second (FPS) at a resolution of 1,352$\times$1,014 on a single GPU. △ Less

Submitted 4 July, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

Comments: 17 pages, 11 figures, ECCV 2024

arXiv:2310.14602 [pdf, ps, other]

Generative Pre-trained Transformer for Vietnamese Community-based COVID-19 Question Answering

Authors: Tam Minh Vo, Khiem Vinh Tran

Abstract: Recent studies have provided empirical evidence of the wide-ranging potential of Generative Pre-trained Transformer (GPT), a pretrained language model, in the field of natural language processing. GPT has been effectively employed as a decoder within state-of-the-art (SOTA) question answering systems, yielding exceptional performance across various tasks. However, the current research landscape co… ▽ More Recent studies have provided empirical evidence of the wide-ranging potential of Generative Pre-trained Transformer (GPT), a pretrained language model, in the field of natural language processing. GPT has been effectively employed as a decoder within state-of-the-art (SOTA) question answering systems, yielding exceptional performance across various tasks. However, the current research landscape concerning GPT's application in Vietnamese remains limited. This paper aims to address this gap by presenting an implementation of GPT-2 for community-based question answering specifically focused on COVID-19 related queries in Vietnamese. We introduce a novel approach by conducting a comparative analysis of different Transformers vs SOTA models in the community-based COVID-19 question answering dataset. The experimental findings demonstrate that the GPT-2 models exhibit highly promising outcomes, outperforming other SOTA models as well as previous community-based COVID-19 question answering models developed for Vietnamese. △ Less

Submitted 31 October, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

arXiv:2308.10005 [pdf, other]

Partition-and-Debias: Agnostic Biases Mitigation via A Mixture of Biases-Specific Experts

Authors: Jiaxuan Li, Duc Minh Vo, Hideki Nakayama

Abstract: Bias mitigation in image classification has been widely researched, and existing methods have yielded notable results. However, most of these methods implicitly assume that a given image contains only one type of known or unknown bias, failing to consider the complexities of real-world biases. We introduce a more challenging scenario, agnostic biases mitigation, aiming at bias removal regardless o… ▽ More Bias mitigation in image classification has been widely researched, and existing methods have yielded notable results. However, most of these methods implicitly assume that a given image contains only one type of known or unknown bias, failing to consider the complexities of real-world biases. We introduce a more challenging scenario, agnostic biases mitigation, aiming at bias removal regardless of whether the type of bias or the number of types is unknown in the datasets. To address this difficult task, we present the Partition-and-Debias (PnD) method that uses a mixture of biases-specific experts to implicitly divide the bias space into multiple subspaces and a gating module to find a consensus among experts to achieve debiased classification. Experiments on both public and constructed benchmarks demonstrated the efficacy of the PnD. Code is available at: https://github.com/Jiaxuan-Li/PnD. △ Less

Submitted 19 August, 2023; originally announced August 2023.

Comments: ICCV 2023

arXiv:2307.08995 [pdf, other]

Revisiting Latent Space of GAN Inversion for Real Image Editing

Authors: Kai Katsumata, Duc Minh Vo, Bei Liu, Hideki Nakayama

Abstract: The exploration of the latent space in StyleGANs and GAN inversion exemplify impressive real-world image editing, yet the trade-off between reconstruction quality and editing quality remains an open problem. In this study, we revisit StyleGANs' hyperspherical prior $\mathcal{Z}$ and combine it with highly capable latent spaces to build combined spaces that faithfully invert real images while maint… ▽ More The exploration of the latent space in StyleGANs and GAN inversion exemplify impressive real-world image editing, yet the trade-off between reconstruction quality and editing quality remains an open problem. In this study, we revisit StyleGANs' hyperspherical prior $\mathcal{Z}$ and combine it with highly capable latent spaces to build combined spaces that faithfully invert real images while maintaining the quality of edited images. More specifically, we propose $\mathcal{F}/\mathcal{Z}^{+}$ space consisting of two subspaces: $\mathcal{F}$ space of an intermediate feature map of StyleGANs enabling faithful reconstruction and $\mathcal{Z}^{+}$ space of an extended StyleGAN prior supporting high editing quality. We project the real images into the proposed space to obtain the inverted codes, by which we then move along $\mathcal{Z}^{+}$, enabling semantic editing without sacrificing image quality. Comprehensive experiments show that $\mathcal{Z}^{+}$ can replace the most commonly-used $\mathcal{W}$, $\mathcal{W}^{+}$, and $\mathcal{S}$ spaces while preserving reconstruction quality, resulting in reduced distortion of edited images. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Comments: 10 pages, 12 figures. arXiv admin note: substantial text overlap with arXiv:2306.00241

arXiv:2307.08319 [pdf, other]

Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and Uncurated Unlabeled Data

Authors: Kai Katsumata, Duc Minh Vo, Tatsuya Harada, Hideki Nakayama

Abstract: Label-noise or curated unlabeled data is used to compensate for the assumption of clean labeled data in training the conditional generative adversarial network; however, satisfying such an extended assumption is occasionally laborious or impractical. As a step towards generative modeling accessible to everyone, we introduce a novel conditional image generation framework that accepts noisy-labeled… ▽ More Label-noise or curated unlabeled data is used to compensate for the assumption of clean labeled data in training the conditional generative adversarial network; however, satisfying such an extended assumption is occasionally laborious or impractical. As a step towards generative modeling accessible to everyone, we introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated unlabeled data during training: (i) closed-set and open-set label noise in labeled data and (ii) closed-set and open-set unlabeled data. To combat it, we propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data and correcting wrong labels for labeled data. Unlike popular curriculum learning, which uses a threshold to pick the training samples, our soft curriculum controls the effect of each training instance by using the weights predicted by the auxiliary classifier, resulting in the preservation of useful samples while ignoring harmful ones. Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance. In particular, the proposed approach is able to match the performance of (semi-) supervised GANs even with less than half the labeled data. △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: 10 pages, 13 figures

arXiv:2307.05058 [pdf, ps, other]

Multi-parameter Szemerédi-Trotter-type theorems and applications in finite fields

Authors: Hung Le, Steven Senger, Minh-Quan Vo

Abstract: We prove some novel multi-parameter point-line incidence estimates in vector spaces over finite fields. While these could be seen as special cases of higher-dimensional incidence results, they outperform their more general counterparts in those contexts. We go on to present a number of applications to illustrate their use in combinatorial problems from geometry and number theory. We prove some novel multi-parameter point-line incidence estimates in vector spaces over finite fields. While these could be seen as special cases of higher-dimensional incidence results, they outperform their more general counterparts in those contexts. We go on to present a number of applications to illustrate their use in combinatorial problems from geometry and number theory. △ Less

Submitted 7 August, 2023; v1 submitted 11 July, 2023; originally announced July 2023.

MSC Class: 51A20

arXiv:2306.00241 [pdf, other]

Balancing Reconstruction and Editing Quality of GAN Inversion for Real Image Editing with StyleGAN Prior Latent Space

Authors: Kai Katsumata, Duc Minh Vo, Bei Liu, Hideki Nakayama

Abstract: The exploration of the latent space in StyleGANs and GAN inversion exemplify impressive real-world image editing, yet the trade-off between reconstruction quality and editing quality remains an open problem. In this study, we revisit StyleGANs' hyperspherical prior $\mathcal{Z}$ and $\mathcal{Z}^+$ and integrate them into seminal GAN inversion methods to improve editing quality. Besides faithful r… ▽ More The exploration of the latent space in StyleGANs and GAN inversion exemplify impressive real-world image editing, yet the trade-off between reconstruction quality and editing quality remains an open problem. In this study, we revisit StyleGANs' hyperspherical prior $\mathcal{Z}$ and $\mathcal{Z}^+$ and integrate them into seminal GAN inversion methods to improve editing quality. Besides faithful reconstruction, our extensions achieve sophisticated editing quality with the aid of the StyleGAN prior. We project the real images into the proposed space to obtain the inverted codes, by which we then move along $\mathcal{Z}^{+}$, enabling semantic editing without sacrificing image quality. Comprehensive experiments show that $\mathcal{Z}^{+}$ can replace the most commonly-used $\mathcal{W}$, $\mathcal{W}^{+}$, and $\mathcal{S}$ spaces while preserving reconstruction quality, resulting in reduced distortion of edited images. △ Less

Submitted 31 May, 2023; originally announced June 2023.

Comments: 5 pages, 9 figures, AI4CC Workshop at CVPR 2023

arXiv:2305.16487 [pdf, other]

EgoHumans: An Egocentric 3D Multi-Human Benchmark

Authors: Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh Vo, Kris Kitani

Abstract: We present EgoHumans, a new multi-view multi-human video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. Existing egocentric benchmarks either capture single subject or indoor-only scenarios, which limit the generalization of computer vision algorithms for real-world applications. We propose a novel 3D capture setup to construct a comprehensive egocen… ▽ More We present EgoHumans, a new multi-view multi-human video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. Existing egocentric benchmarks either capture single subject or indoor-only scenarios, which limit the generalization of computer vision algorithms for real-world applications. We propose a novel 3D capture setup to construct a comprehensive egocentric multi-human benchmark in the wild with annotations to support diverse tasks such as human detection, tracking, 2D/3D pose estimation, and mesh recovery. We leverage consumer-grade wearable camera-equipped glasses for the egocentric view, which enables us to capture dynamic activities like playing tennis, fencing, volleyball, etc. Furthermore, our multi-view setup generates accurate 3D ground truth even under severe or complete occlusion. The dataset consists of more than 125k egocentric images, spanning diverse scenes with a particular focus on challenging and unchoreographed multi-human activities and fast-moving egocentric views. We rigorously evaluate existing state-of-the-art methods and highlight their limitations in the egocentric scenario, specifically on multi-human tracking. To address such limitations, we propose EgoFormer, a novel approach with a multi-stream transformer architecture and explicit 3D spatial reasoning to estimate and track the human pose. EgoFormer significantly outperforms prior art by 13.6% IDF1 on the EgoHumans dataset. △ Less

Submitted 18 August, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: Accepted to ICCV 2023 (Oral)

arXiv:2304.08700 [pdf]

doi 10.1016/j.physletb.2023.138095

Multipole Expansion for the Electron-Nucleus Scattering at High Energies in the Unified Electroweak Theory

Authors: Z. P. Luong, M. T. Vo

Abstract: The paper presents the multipole expansion for the electron-nucleus scattering cross section at high energies within the framework of the unified electroweak theory. The electroweak currents of the nucleus are expanded into the simple components with definite angular momenta, called the multipole form factors. The multipole expansion of the cross section is a consequence of the above expansion. Be… ▽ More The paper presents the multipole expansion for the electron-nucleus scattering cross section at high energies within the framework of the unified electroweak theory. The electroweak currents of the nucleus are expanded into the simple components with definite angular momenta, called the multipole form factors. The multipole expansion of the cross section is a consequence of the above expansion. Besides the familiar electromagnetic form factors, there are also the vector and axial form factors, respectively, related to weak interactions. To determine multipole form factors, general formulas for the calculation of reduced matrix elements are established using the fractional parentage coefficient method and the multiparticle shell model. Calculation of them enables us to obtain more detailed information about the nuclear structure and elucidate the role played by the weak interaction in the high-energy reaction mechanisms. △ Less

Submitted 5 August, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

Comments: 8 pages, 2 figures

arXiv:2304.06602 [pdf, other]

A-CAP: Anticipation Captioning with Commonsense Knowledge

Authors: Duc Minh Vo, Quoc-An Luong, Akihiro Sugimoto, Hideki Nakayama

Abstract: Humans possess the capacity to reason about the future based on a sparse collection of visual cues acquired over time. In order to emulate this ability, we introduce a novel task called Anticipation Captioning, which generates a caption for an unseen oracle image using a sparsely temporally-ordered set of images. To tackle this new task, we propose a model called A-CAP, which incorporates commonse… ▽ More Humans possess the capacity to reason about the future based on a sparse collection of visual cues acquired over time. In order to emulate this ability, we introduce a novel task called Anticipation Captioning, which generates a caption for an unseen oracle image using a sparsely temporally-ordered set of images. To tackle this new task, we propose a model called A-CAP, which incorporates commonsense knowledge into a pre-trained vision-language model, allowing it to anticipate the caption. Through both qualitative and quantitative evaluations on a customized visual storytelling dataset, A-CAP outperforms other image captioning methods and establishes a strong baseline for anticipation captioning. We also address the challenges inherent in this task. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: Accepted to CVPR 2023

arXiv:2303.10229 [pdf, other]

Distinct Distances in $R^3$ Between Quadratic and Orthogonal Curves

Authors: Toby Aldape, **gyi Liu, Gregory Pylypovych, Adam Sheffer, Minh-Quan Vo

Abstract: We study the minimum number of distinct distances between point sets on two curves in $R^3$. Assume that one curve contains $m$ points and the other $n$ points. Our main results: (a) When the curves are conic sections, we characterize all cases where the number of distances is $O(m+n)$. This includes new constructions for points on two parabolas, two ellipses, and one ellipse and one hyperbola.… ▽ More We study the minimum number of distinct distances between point sets on two curves in $R^3$. Assume that one curve contains $m$ points and the other $n$ points. Our main results: (a) When the curves are conic sections, we characterize all cases where the number of distances is $O(m+n)$. This includes new constructions for points on two parabolas, two ellipses, and one ellipse and one hyperbola. In all other cases, the number of distances is $Ω(\min\{m^{2/3}n^{2/3},m^2,n^2\})$. (b) When the curves are not necessarily algebraic but smooth and contained in perpendicular planes, we characterize all cases where the number of distances is $O(m+n)$. This includes a surprising new construction of non-algebraic curves that involve logarithms. In all other cases, the number of distances is $Ω(\min\{m^{2/3}n^{2/3},m^2,n^2\})$. △ Less

Submitted 17 March, 2023; originally announced March 2023.

arXiv:2211.13470 [pdf, other]

Efficient Zero-shot Visual Search via Target and Context-aware Transformer

Authors: Zhiwei Ding, Xuezhe Ren, Erwan David, Melissa Vo, Gabriel Kreiman, Mengmi Zhang

Abstract: Visual search is a ubiquitous challenge in natural vision, including daily tasks such as finding a friend in a crowd or searching for a car in a parking lot. Human rely heavily on relevant target features to perform goal-directed visual search. Meanwhile, context is of critical importance for locating a target object in complex scenes as it helps narrow down the search area and makes the search pr… ▽ More Visual search is a ubiquitous challenge in natural vision, including daily tasks such as finding a friend in a crowd or searching for a car in a parking lot. Human rely heavily on relevant target features to perform goal-directed visual search. Meanwhile, context is of critical importance for locating a target object in complex scenes as it helps narrow down the search area and makes the search process more efficient. However, few works have combined both target and context information in visual search computational models. Here we propose a zero-shot deep learning architecture, TCT (Target and Context-aware Transformer), that modulates self attention in the Vision Transformer with target and contextual relevant information to enable human-like zero-shot visual search performance. Target modulation is computed as patch-wise local relevance between the target and search images, whereas contextual modulation is applied in a global fashion. We conduct visual search experiments on TCT and other competitive visual search models on three natural scene datasets with varying levels of difficulty. TCT demonstrates human-like performance in terms of search efficiency and beats the SOTA models in challenging visual search tasks. Importantly, TCT generalizes well across datasets with novel objects without retraining or fine-tuning. Furthermore, we also introduce a new dataset to benchmark models for invariant visual search under incongruent contexts. TCT manages to search flexibly via target and context modulation, even under incongruent contexts. △ Less

Submitted 24 November, 2022; originally announced November 2022.

arXiv:2210.08459 [pdf, other]

StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning

Authors: Hong Chen, Duc Minh Vo, Hiroya Takamura, Yusuke Miyao, Hideki Nakayama

Abstract: Existing automatic story evaluation methods place a premium on story lexical level coherence, deviating from human preference. We go beyond this limitation by considering a novel \textbf{Story} \textbf{E}valuation method that mimics human preference when judging a story, namely \textbf{StoryER}, which consists of three sub-tasks: \textbf{R}anking, \textbf{R}ating and \textbf{R}easoning. Given eith… ▽ More Existing automatic story evaluation methods place a premium on story lexical level coherence, deviating from human preference. We go beyond this limitation by considering a novel \textbf{Story} \textbf{E}valuation method that mimics human preference when judging a story, namely \textbf{StoryER}, which consists of three sub-tasks: \textbf{R}anking, \textbf{R}ating and \textbf{R}easoning. Given either a machine-generated or a human-written story, StoryER requires the machine to output 1) a preference score that corresponds to human preference, 2) specific ratings and their corresponding confidences and 3) comments for various aspects (e.g., opening, character-sha**). To support these tasks, we introduce a well-annotated dataset comprising (i) 100k ranked story pairs; and (ii) a set of 46k ratings and comments on various aspects of the story. We finetune Longformer-Encoder-Decoder (LED) on the collected dataset, with the encoder responsible for preference score and aspect prediction and the decoder for comment generation. Our comprehensive experiments result in a competitive benchmark for each task, showing the high correlation to human preference. In addition, we have witnessed the joint learning of the preference scores, the aspect ratings, and the comments brings gain in each single task. Our dataset and benchmarks are publicly available to advance the research of story evaluation tasks.\footnote{Dataset and pre-trained model demo are available at anonymous website \url{http://storytelling-lab.com/eval} and \url{https://github.com/sairin1202/StoryER}} △ Less

Submitted 21 October, 2022; v1 submitted 16 October, 2022; originally announced October 2022.

Comments: accepted by EMNLP 2022

arXiv:2209.07629 [pdf, other]

Self-Relation Attention and Temporal Awareness for Emotion Recognition via Vocal Burst

Authors: Dang-Linh Trinh, Minh-Cong Vo, Guee-Sang Lee

Abstract: The technical report presents our emotion recognition pipeline for high-dimensional emotion task (A-VB High) in The ACII Affective Vocal Bursts (A-VB) 2022 Workshop \& Competition. Our proposed method contains three stages. Firstly, we extract the latent features from the raw audio signal and its Mel-spectrogram by self-supervised learning methods. Then, the features from the raw signal are fed to… ▽ More The technical report presents our emotion recognition pipeline for high-dimensional emotion task (A-VB High) in The ACII Affective Vocal Bursts (A-VB) 2022 Workshop \& Competition. Our proposed method contains three stages. Firstly, we extract the latent features from the raw audio signal and its Mel-spectrogram by self-supervised learning methods. Then, the features from the raw signal are fed to the self-relation attention and temporal awareness (SA-TA) module for learning the valuable information between these latent features. Finally, we concatenate all the features and utilize a fully-connected layer to predict each emotion's score. By empirical experiments, our proposed method achieves a mean concordance correlation coefficient (CCC) of 0.7295 on the test set, compared to 0.5686 on the baseline model. The code of our method is available at https://github.com/linhtd812/A-VB2022. △ Less

Submitted 26 September, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

arXiv:2209.04794 [pdf, other]

doi 10.1371/journal.pone.0276545

Learning to diagnose common thorax diseases on chest radiographs from radiology reports in Vietnamese

Authors: Thao T. B. Nguyen, Tam M. Vo, Thang V. Nguyen, Hieu H. Pham, Ha Q. Nguyen

Abstract: We propose a data collecting and annotation pipeline that extracts information from Vietnamese radiology reports to provide accurate labels for chest X-ray (CXR) images. This can benefit Vietnamese radiologists and clinicians by annotating data that closely match their endemic diagnosis categories which may vary from country to country. To assess the efficacy of the proposed labeling technique, we… ▽ More We propose a data collecting and annotation pipeline that extracts information from Vietnamese radiology reports to provide accurate labels for chest X-ray (CXR) images. This can benefit Vietnamese radiologists and clinicians by annotating data that closely match their endemic diagnosis categories which may vary from country to country. To assess the efficacy of the proposed labeling technique, we built a CXR dataset containing 9,752 studies and evaluated our pipeline using a subset of this dataset. With an F1-score of at least 0.9923, the evaluation demonstrates that our labeling tool performs precisely and consistently across all classes. After building the dataset, we train deep learning models that leverage knowledge transferred from large public CXR datasets. We employ a variety of loss functions to overcome the curse of imbalanced multi-label datasets and conduct experiments with various model architectures to select the one that delivers the best performance. Our best model (CheXpert-pretrained EfficientNet-B2) yields an F1-score of 0.6989 (95% CI 0.6740, 0.7240), AUC of 0.7912, sensitivity of 0.7064 and specificity of 0.8760 for the abnormal diagnosis in general. Finally, we demonstrate that our coarse classification (based on five specific locations of abnormalities) yields comparable results to fine classification (twelve pathologies) on the benchmark CheXpert dataset for general anomaly detection while delivering better performance in terms of the average performance of all classes. △ Less

Submitted 11 September, 2022; originally announced September 2022.

Comments: This work has been provisionally accepted for publication by Plos One journal

arXiv:2207.05249 [pdf, other]

Efficient Human Vision Inspired Action Recognition using Adaptive Spatiotemporal Sampling

Authors: Khoi-Nguyen C. Mac, Minh N. Do, Minh P. Vo

Abstract: Adaptive sampling that exploits the spatiotemporal redundancy in videos is critical for always-on action recognition on wearable devices with limited computing and battery resources. The commonly used fixed sampling strategy is not context-aware and may under-sample the visual content, and thus adversely impacts both computation efficiency and accuracy. Inspired by the concepts of foveal vision an… ▽ More Adaptive sampling that exploits the spatiotemporal redundancy in videos is critical for always-on action recognition on wearable devices with limited computing and battery resources. The commonly used fixed sampling strategy is not context-aware and may under-sample the visual content, and thus adversely impacts both computation efficiency and accuracy. Inspired by the concepts of foveal vision and pre-attentive processing from the human visual perception mechanism, we introduce a novel adaptive spatiotemporal sampling scheme for efficient action recognition. Our system pre-scans the global scene context at low-resolution and decides to skip or request high-resolution features at salient regions for further processing. We validate the system on EPIC-KITCHENS and UCF-101 datasets for action recognition, and show that our proposed approach can greatly speed up inference with a tolerable loss of accuracy compared with those from state-of-the-art baselines. Source code is available in https://github.com/knmac/adaptive_spatiotemporal. △ Less

Submitted 14 July, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

arXiv:2207.04320 [pdf, other]

Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet

Authors: Shihao Zou, Yuanlu Xu, Chao Li, Lingni Ma, Li Cheng, Minh Vo

Abstract: Multi-person pose understanding from RGB videos involves three complex tasks: pose estimation, tracking and motion forecasting. Intuitively, accurate multi-person pose estimation facilitates robust tracking, and robust tracking builds crucial history for correct motion forecasting. Most existing works either focus on a single task or employ multi-stage approaches to solving multiple tasks separate… ▽ More Multi-person pose understanding from RGB videos involves three complex tasks: pose estimation, tracking and motion forecasting. Intuitively, accurate multi-person pose estimation facilitates robust tracking, and robust tracking builds crucial history for correct motion forecasting. Most existing works either focus on a single task or employ multi-stage approaches to solving multiple tasks separately, which tends to make sub-optimal decision at each stage and also fail to exploit correlations among the three tasks. In this paper, we propose Snipper, a unified framework to perform multi-person 3D pose estimation, tracking, and motion forecasting simultaneously in a single stage. We propose an efficient yet powerful deformable attention mechanism to aggregate spatiotemporal information from the video snippet. Building upon this deformable attention, a video transformer is learned to encode the spatiotemporal features from the multi-frame snippet and to decode informative pose features for multi-person pose queries. Finally, these pose queries are regressed to predict multi-person pose trajectories and future motions in a single shot. In the experiments, we show the effectiveness of Snipper on three challenging public datasets where our generic model rivals specialized state-of-art baselines for pose estimation, tracking, and forecasting. △ Less

Submitted 12 September, 2023; v1 submitted 9 July, 2022; originally announced July 2022.

arXiv:2206.08929 [pdf, other]

TAVA: Template-free Animatable Volumetric Actors

Authors: Ruilong Li, Julian Tanke, Minh Vo, Michael Zollhofer, Jurgen Gall, Angjoo Kanazawa, Christoph Lassner

Abstract: Coordinate-based volumetric representations have the potential to generate photo-realistic virtual avatars from images. However, virtual avatars also need to be controllable even to a novel pose that may not have been observed. Traditional techniques, such as LBS, provide such a function; yet it usually requires a hand-designed body template, 3D scan data, and limited appearance models. On the oth… ▽ More Coordinate-based volumetric representations have the potential to generate photo-realistic virtual avatars from images. However, virtual avatars also need to be controllable even to a novel pose that may not have been observed. Traditional techniques, such as LBS, provide such a function; yet it usually requires a hand-designed body template, 3D scan data, and limited appearance models. On the other hand, neural representation has been shown to be powerful in representing visual details, but are under explored on deforming dynamic articulated actors. In this paper, we propose TAVA, a method to create T emplate-free Animatable Volumetric Actors, based on neural representations. We rely solely on multi-view data and a tracked skeleton to create a volumetric model of an actor, which can be animated at the test time given novel pose. Since TAVA does not require a body template, it is applicable to humans as well as other creatures such as animals. Furthermore, TAVA is designed such that it can recover accurate dense correspondences, making it amenable to content-creation and editing tasks. Through extensive experiments, we demonstrate that the proposed method generalizes well to novel poses as well as unseen views and showcase basic editing capabilities. △ Less

Submitted 20 June, 2022; v1 submitted 17 June, 2022; originally announced June 2022.

Comments: Code: https://github.com/facebookresearch/tava; Project Website: https://www.liruilong.cn/projects/tava/

arXiv:2204.14249 [pdf, other]

OSSGAN: Open-Set Semi-Supervised Image Generation

Authors: Kai Katsumata, Duc Minh Vo, Hideki Nakayama

Abstract: We introduce a challenging training scheme of conditional GANs, called open-set semi-supervised image generation, where the training dataset consists of two parts: (i) labeled data and (ii) unlabeled data with samples belonging to one of the labeled data classes, namely, a closed-set, and samples not belonging to any of the labeled data classes, namely, an open-set. Unlike the existing semi-superv… ▽ More We introduce a challenging training scheme of conditional GANs, called open-set semi-supervised image generation, where the training dataset consists of two parts: (i) labeled data and (ii) unlabeled data with samples belonging to one of the labeled data classes, namely, a closed-set, and samples not belonging to any of the labeled data classes, namely, an open-set. Unlike the existing semi-supervised image generation task, where unlabeled data only contain closed-set samples, our task is more general and lowers the data collection cost in practice by allowing open-set samples to appear. Thanks to entropy regularization, the classifier that is trained on labeled data is able to quantify sample-wise importance to the training of cGAN as confidence, allowing us to use all samples in unlabeled data. We design OSSGAN, which provides decision clues to the discriminator on the basis of whether an unlabeled image belongs to one or none of the classes of interest, smoothly integrating labeled and unlabeled data during training. The results of experiments on Tiny ImageNet and ImageNet show notable improvements over supervised BigGAN and semi-supervised methods. Our code is available at https://github.com/raven38/OSSGAN. △ Less

Submitted 29 April, 2022; originally announced April 2022.

Comments: Accepted at CVPR 2022

arXiv:2204.01695 [pdf, other]

LISA: Learning Implicit Shape and Appearance of Hands

Authors: Enric Corona, Tomas Hodan, Minh Vo, Francesc Moreno-Noguer, Chris Sweeney, Richard Newcombe, Lingni Ma

Abstract: This paper proposes a do-it-all neural model of human hands, named LISA. The model can capture accurate hand shape and appearance, generalize to arbitrary hand subjects, provide dense surface correspondences, be reconstructed from images in the wild and easily animated. We train LISA by minimizing the shape and appearance losses on a large set of multi-view RGB image sequences annotated with coars… ▽ More This paper proposes a do-it-all neural model of human hands, named LISA. The model can capture accurate hand shape and appearance, generalize to arbitrary hand subjects, provide dense surface correspondences, be reconstructed from images in the wild and easily animated. We train LISA by minimizing the shape and appearance losses on a large set of multi-view RGB image sequences annotated with coarse 3D poses of the hand skeleton. For a 3D point in the hand local coordinate, our model predicts the color and the signed distance with respect to each hand bone independently, and then combines the per-bone predictions using predicted skinning weights. The shape, color and pose representations are disentangled by design, allowing to estimate or animate only selected parameters. We experimentally demonstrate that LISA can accurately reconstruct a dynamic hand from monocular or multi-view sequences, achieving a noticeably higher quality of reconstructed hand shapes compared to baseline approaches. Project page: https://www.iri.upc.edu/people/ecorona/lisa/. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: Published at CVPR 2022

arXiv:2203.14499 [pdf, other]

NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge

Authors: Duc Minh Vo, Hong Chen, Akihiro Sugimoto, Hideki Nakayama

Abstract: Novel object captioning aims at describing objects absent from training data, with the key ingredient being the provision of object vocabulary to the model. Although existing methods heavily rely on an object detection model, we view the detection step as vocabulary retrieval from an external knowledge in the form of embeddings for any object's definition from Wiktionary, where we use in the retri… ▽ More Novel object captioning aims at describing objects absent from training data, with the key ingredient being the provision of object vocabulary to the model. Although existing methods heavily rely on an object detection model, we view the detection step as vocabulary retrieval from an external knowledge in the form of embeddings for any object's definition from Wiktionary, where we use in the retrieval image region features learned from a transformers model. We propose an end-to-end Novel Object Captioning with Retrieved vocabulary from External Knowledge method (NOC-REK), which simultaneously learns vocabulary retrieval and caption generation, successfully describing novel objects outside of the training dataset. Furthermore, our model eliminates the requirement for model retraining by simply updating the external knowledge whenever a novel object appears. Our comprehensive experiments on held-out COCO and Nocaps datasets show that our NOC-REK is considerably effective against SOTAs. △ Less

Submitted 28 March, 2022; originally announced March 2022.

Comments: Accepted at CVPR 2022

arXiv:2203.08456 [pdf, other]

PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression

Authors: Duc Minh Vo, Akihiro Sugimoto, Hideki Nakayama

Abstract: We push forward neural network compression research by exploiting a novel challenging task of large-scale conditional generative adversarial networks (GANs) compression. To this end, we propose a gradually shrinking GAN (PPCD-GAN) by introducing progressive pruning residual block (PP-Res) and class-aware distillation. The PP-Res is an extension of the conventional residual block where each convolu… ▽ More We push forward neural network compression research by exploiting a novel challenging task of large-scale conditional generative adversarial networks (GANs) compression. To this end, we propose a gradually shrinking GAN (PPCD-GAN) by introducing progressive pruning residual block (PP-Res) and class-aware distillation. The PP-Res is an extension of the conventional residual block where each convolutional layer is followed by a learnable mask layer to progressively prune network parameters as training proceeds. The class-aware distillation, on the other hand, enhances the stability of training by transferring immense knowledge from a well-trained teacher model through instructive attention maps. We train the pruning and distillation processes simultaneously on a well-known GAN architecture in an end-to-end manner. After training, all redundant parameters as well as the mask layers are discarded, yielding a lighter network while retaining the performance. We comprehensively illustrate, on ImageNet 128x128 dataset, PPCD-GAN reduces up to 5.2x (81%) parameters against state-of-the-arts while kee** better performance. △ Less

Submitted 16 March, 2022; originally announced March 2022.

Comments: accepted at WACV 2022

arXiv:2202.10753 [pdf, other]

Convolutional Neural Network Modelling for MODIS Land Surface Temperature Super-Resolution

Authors: Binh Minh Nguyen, Ganglin Tian, Minh-Triet Vo, Aurélie Michel, Thomas Corpetti, Carlos Granero-Belinchon

Abstract: Nowadays, thermal infrared satellite remote sensors enable to extract very interesting information at large scale, in particular Land Surface Temperature (LST). However such data are limited in spatial and/or temporal resolutions which prevents from an analysis at fine scales. For example, MODIS satellite provides daily acquisitions with 1Km spatial resolutions which is not sufficient to deal with… ▽ More Nowadays, thermal infrared satellite remote sensors enable to extract very interesting information at large scale, in particular Land Surface Temperature (LST). However such data are limited in spatial and/or temporal resolutions which prevents from an analysis at fine scales. For example, MODIS satellite provides daily acquisitions with 1Km spatial resolutions which is not sufficient to deal with highly heterogeneous environments as agricultural parcels. Therefore, image super-resolution is a crucial task to better exploit MODIS LSTs. This issue is tackled in this paper. We introduce a deep learning-based algorithm, named Multi-residual U-Net, for super-resolution of MODIS LST single-images. Our proposed network is a modified version of U-Net architecture, which aims at super-resolving the input LST image from 1Km to 250m per pixel. The results show that our Multi-residual U-Net outperforms other state-of-the-art methods. △ Less

Submitted 1 April, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

arXiv:2112.12761 [pdf, other]

BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Authors: Gengshan Yang, Minh Vo, Natalia Neverova, Deva Ramanan, Andrea Vedaldi, Hanbyul Joo

Abstract: Prior work for articulated 3D shape reconstruction often relies on specialized sensors (e.g., synchronized multi-camera systems), or pre-built 3D deformable models (e.g., SMAL or SMPL). Such methods are not able to scale to diverse sets of objects in the wild. We present BANMo, a method that requires neither a specialized sensor nor a pre-defined template shape. BANMo builds high-fidelity, articul… ▽ More Prior work for articulated 3D shape reconstruction often relies on specialized sensors (e.g., synchronized multi-camera systems), or pre-built 3D deformable models (e.g., SMAL or SMPL). Such methods are not able to scale to diverse sets of objects in the wild. We present BANMo, a method that requires neither a specialized sensor nor a pre-defined template shape. BANMo builds high-fidelity, articulated 3D models (including shape and animatable skinning weights) from many monocular casual videos in a differentiable rendering framework. While the use of many videos provides more coverage of camera views and object articulations, they introduce significant challenges in establishing correspondence across scenes with different backgrounds, illumination conditions, etc. Our key insight is to merge three schools of thought; (1) classic deformable shape models that make use of articulated bones and blend skinning, (2) volumetric neural radiance fields (NeRFs) that are amenable to gradient-based optimization, and (3) canonical embeddings that generate correspondences between pixels and an articulated model. We introduce neural blend skinning models that allow for differentiable and invertible articulated deformations. When combined with canonical embeddings, such models allow us to establish dense correspondences across videos that can be self-supervised with cycle consistency. On real and synthetic datasets, BANMo shows higher-fidelity 3D reconstructions than prior works for humans and animals, with the ability to render realistic images from novel viewpoints and poses. Project webpage: banmo-www.github.io . △ Less

Submitted 3 April, 2023; v1 submitted 23 December, 2021; originally announced December 2021.

Comments: CVPR 2022 camera-ready version (last update: May 2022)

arXiv:2110.07058 [pdf, other]

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/ △ Less

Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

arXiv:2108.10165 [pdf, other]

ODAM: Object Detection, Association, and Map** using Posed RGB Video

Authors: Kejie Li, Daniel DeTone, Steven Chen, Minh Vo, Ian Reid, Hamid Rezatofighi, Chris Sweeney, Julian Straub, Richard Newcombe

Abstract: Localizing objects and estimating their extent in 3D is an important step towards high-level 3D scene understanding, which has many applications in Augmented Reality and Robotics. We present ODAM, a system for 3D Object Detection, Association, and Map** using posed RGB videos. The proposed system relies on a deep learning front-end to detect 3D objects from a given RGB frame and associate them t… ▽ More Localizing objects and estimating their extent in 3D is an important step towards high-level 3D scene understanding, which has many applications in Augmented Reality and Robotics. We present ODAM, a system for 3D Object Detection, Association, and Map** using posed RGB videos. The proposed system relies on a deep learning front-end to detect 3D objects from a given RGB frame and associate them to a global object-based map using a graph neural network (GNN). Based on these frame-to-model associations, our back-end optimizes object bounding volumes, represented as super-quadrics, under multi-view geometry constraints and the object scale prior. We validate the proposed system on ScanNet where we show a significant improvement over existing RGB-only methods. △ Less

Submitted 23 August, 2021; originally announced August 2021.

Comments: Accepted in ICCV 2021 as oral

arXiv:2104.07267 [pdf, other]

ContactOpt: Optimizing Contact to Improve Grasps

Authors: Patrick Grady, Chengcheng Tang, Christopher D. Twigg, Minh Vo, Samarth Brahmbhatt, Charles C. Kemp

Abstract: Physical contact between hands and objects plays a critical role in human grasps. We show that optimizing the pose of a hand to achieve expected contact with an object can improve hand poses inferred via image-based methods. Given a hand mesh and an object mesh, a deep model trained on ground truth contact data infers desirable contact across the surfaces of the meshes. Then, ContactOpt efficientl… ▽ More Physical contact between hands and objects plays a critical role in human grasps. We show that optimizing the pose of a hand to achieve expected contact with an object can improve hand poses inferred via image-based methods. Given a hand mesh and an object mesh, a deep model trained on ground truth contact data infers desirable contact across the surfaces of the meshes. Then, ContactOpt efficiently optimizes the pose of the hand to achieve desirable contact using a differentiable contact model. Notably, our contact model encourages mesh interpenetration to approximate deformable soft tissue in the hand. In our evaluations, our methods result in grasps that better match ground truth contact, have lower kinematic error, and are significantly preferred by human participants. Code and models are available online. △ Less

Submitted 15 April, 2021; originally announced April 2021.

Comments: Conference on Computer Vision and Pattern Recognition (CVPR) 2021

arXiv:2012.12890 [pdf, other]

ANR: Articulated Neural Rendering for Virtual Avatars

Authors: Amit Raj, Julian Tanke, James Hays, Minh Vo, Carsten Stoll, Christoph Lassner

Abstract: The combination of traditional rendering with neural networks in Deferred Neural Rendering (DNR) provides a compelling balance between computational complexity and realism of the resulting images. Using skinned meshes for rendering articulating objects is a natural extension for the DNR framework and would open it up to a plethora of applications. However, in this case the neural shading step must… ▽ More The combination of traditional rendering with neural networks in Deferred Neural Rendering (DNR) provides a compelling balance between computational complexity and realism of the resulting images. Using skinned meshes for rendering articulating objects is a natural extension for the DNR framework and would open it up to a plethora of applications. However, in this case the neural shading step must account for deformations that are possibly not captured in the mesh, as well as alignment inaccuracies and dynamics -- which can confound the DNR pipeline. We present Articulated Neural Rendering (ANR), a novel framework based on DNR which explicitly addresses its limitations for virtual human avatars. We show the superiority of ANR not only with respect to DNR but also with methods specialized for avatar creation and animation. In two user studies, we observe a clear preference for our avatar model and we demonstrate state-of-the-art performance on quantitative evaluation metrics. Perceptually, we observe better temporal stability, level of detail and plausibility. △ Less

Submitted 23 December, 2020; originally announced December 2020.

arXiv:2008.00158 [pdf, ps, other]

TexMesh: Reconstructing Detailed Human Texture and Geometry from RGB-D Video

Authors: Tiancheng Zhi, Christoph Lassner, Tony Tung, Carsten Stoll, Srinivasa G. Narasimhan, Minh Vo

Abstract: We present TexMesh, a novel approach to reconstruct detailed human meshes with high-resolution full-body texture from RGB-D video. TexMesh enables high quality free-viewpoint rendering of humans. Given the RGB frames, the captured environment map, and the coarse per-frame human mesh from RGB-D tracking, our method reconstructs spatiotemporally consistent and detailed per-frame meshes along with a… ▽ More We present TexMesh, a novel approach to reconstruct detailed human meshes with high-resolution full-body texture from RGB-D video. TexMesh enables high quality free-viewpoint rendering of humans. Given the RGB frames, the captured environment map, and the coarse per-frame human mesh from RGB-D tracking, our method reconstructs spatiotemporally consistent and detailed per-frame meshes along with a high-resolution albedo texture. By using the incident illumination we are able to accurately estimate local surface geometry and albedo, which allows us to further use photometric constraints to adapt a synthetically trained model to real-world sequences in a self-supervised manner for detailed surface geometry and high-resolution texture estimation. In practice, we train our models on a short example sequence for self-adaptation and the model runs at interactive framerate afterwards. We validate TexMesh on synthetic and real-world data, and show it outperforms the state of art quantitatively and qualitatively. △ Less

Submitted 20 September, 2020; v1 submitted 31 July, 2020; originally announced August 2020.

Comments: ECCV 2020

arXiv:2007.12806 [pdf, other]

Spatiotemporal Bundle Adjustment for Dynamic 3D Human Reconstruction in the Wild

Authors: Minh Vo, Yaser Sheikh, Srinivasa G. Narasimhan

Abstract: Bundle adjustment jointly optimizes camera intrinsics and extrinsics and 3D point triangulation to reconstruct a static scene. The triangulation constraint, however, is invalid for moving points captured in multiple unsynchronized videos and bundle adjustment is not designed to estimate the temporal alignment between cameras. We present a spatiotemporal bundle adjustment framework that jointly opt… ▽ More Bundle adjustment jointly optimizes camera intrinsics and extrinsics and 3D point triangulation to reconstruct a static scene. The triangulation constraint, however, is invalid for moving points captured in multiple unsynchronized videos and bundle adjustment is not designed to estimate the temporal alignment between cameras. We present a spatiotemporal bundle adjustment framework that jointly optimizes four coupled sub-problems: estimating camera intrinsics and extrinsics, triangulating static 3D points, as well as sub-frame temporal alignment between cameras and computing 3D trajectories of dynamic points. Key to our joint optimization is the careful integration of physics-based motion priors within the reconstruction pipeline, validated on a large motion capture corpus of human subjects. We devise an incremental reconstruction and alignment algorithm to strictly enforce the motion prior during the spatiotemporal bundle adjustment. This algorithm is further made more efficient by a divide and conquer scheme while still maintaining high accuracy. We apply this algorithm to reconstruct 3D motion trajectories of human bodies in dynamic events captured by multiple uncalibrated and unsynchronized video cameras in the wild. To make the reconstruction visually more interpretable, we fit a statistical 3D human body model to the asynchronous video streams.Compared to the baseline, the fitting significantly benefits from the proposed spatiotemporal bundle adjustment procedure. Because the videos are aligned with sub-frame precision, we reconstruct 3D motion at much higher temporal resolution than the input videos. △ Less

Submitted 24 July, 2020; originally announced July 2020.

Comments: Accepted to IEEE TPAMI

arXiv:2007.03672 [pdf, other]

Long-term Human Motion Prediction with Scene Context

Authors: Zhe Cao, Hang Gao, Karttikeya Mangalam, Qi-Zhi Cai, Minh Vo, Jitendra Malik

Abstract: Human movement is goal-directed and influenced by the spatial layout of the objects in the scene. To plan future human motion, it is crucial to perceive the environment -- imagine how hard it is to navigate a new room with lights off. Existing works on predicting human motion do not pay attention to the scene context and thus struggle in long-term prediction. In this work, we propose a novel three… ▽ More Human movement is goal-directed and influenced by the spatial layout of the objects in the scene. To plan future human motion, it is crucial to perceive the environment -- imagine how hard it is to navigate a new room with lights off. Existing works on predicting human motion do not pay attention to the scene context and thus struggle in long-term prediction. In this work, we propose a novel three-stage framework that exploits scene context to tackle this task. Given a single scene image and 2D pose histories, our method first samples multiple human motion goals, then plans 3D human paths towards each goal, and finally predicts 3D human pose sequences following each path. For stable training and rigorous evaluation, we contribute a diverse synthetic dataset with clean annotations. In both synthetic and real datasets, our method shows consistent quantitative and qualitative improvements over existing methods. △ Less

Submitted 31 July, 2020; v1 submitted 7 July, 2020; originally announced July 2020.

Comments: ECCV 2020 Oral. Dataset & Code: https://github.com/ZheC/GTA-IM-Dataset Video: https://people.eecs.berkeley.edu/~zhecao/hmp/index.html

arXiv:2005.13532 [pdf, other]

4D Visualization of Dynamic Events from Unconstrained Multi-View Videos

Authors: Aayush Bansal, Minh Vo, Yaser Sheikh, Deva Ramanan, Srinivasa Narasimhan

Abstract: We present a data-driven approach for 4D space-time visualization of dynamic events from videos captured by hand-held multiple cameras. Key to our approach is the use of self-supervised neural networks specific to the scene to compose static and dynamic aspects of an event. Though captured from discrete viewpoints, this model enables us to move around the space-time of the event continuously. This… ▽ More We present a data-driven approach for 4D space-time visualization of dynamic events from videos captured by hand-held multiple cameras. Key to our approach is the use of self-supervised neural networks specific to the scene to compose static and dynamic aspects of an event. Though captured from discrete viewpoints, this model enables us to move around the space-time of the event continuously. This model allows us to create virtual cameras that facilitate: (1) freezing the time and exploring views; (2) freezing a view and moving through time; and (3) simultaneously changing both time and view. We can also edit the videos and reveal occluded objects for a given view if it is visible in any of the other views. We validate our approach on challenging in-the-wild events captured using up to 15 mobile cameras. △ Less

Submitted 27 May, 2020; originally announced May 2020.

Comments: Project Page - http://www.cs.cmu.edu/~aayushb/Open4D/

arXiv:1911.08079 [pdf, other]

doi 10.1007/s00138-020-01086-1

Two-Stream FCNs to Balance Content and Style for Style Transfer

Authors: Duc Minh Vo, Akihiro Sugimoto

Abstract: Style transfer is to render given image contents in given styles, and it has an important role in both computer vision fundamental research and industrial applications. Following the success of deep learning based approaches, this problem has been re-launched recently, but still remains a difficult task because of trade-off between preserving contents and faithful rendering of styles. Indeed, how… ▽ More Style transfer is to render given image contents in given styles, and it has an important role in both computer vision fundamental research and industrial applications. Following the success of deep learning based approaches, this problem has been re-launched recently, but still remains a difficult task because of trade-off between preserving contents and faithful rendering of styles. Indeed, how well-balanced content and style are is crucial in evaluating the quality of stylized images. In this paper, we propose an end-to-end two-stream Fully Convolutional Networks (FCNs) aiming at balancing the contributions of the content and the style in rendered images. Our proposed network consists of the encoder and decoder parts. The encoder part utilizes a FCN for content and a FCN for style where the two FCNs have feature injections and are independently trained to preserve the semantic content and to learn the faithful style representation in each. The semantic content feature and the style representation feature are then concatenated adaptively and fed into the decoder to generate style-transferred (stylized) images. In order to train our proposed network, we employ a loss network, the pre-trained VGG-16, to compute content loss and style loss, both of which are efficiently used for the feature injection as well as the feature concatenation. Our intensive experiments show that our proposed model generates more balanced stylized images in content and style than state-of-the-art methods. Moreover, our proposed network achieves efficiency in speed. △ Less

Submitted 7 May, 2020; v1 submitted 18 November, 2019; originally announced November 2019.

Comments: published in Machine Vision and Applications

arXiv:1908.01741 [pdf, other]

Visual-Relation Conscious Image Generation from Structured-Text

Authors: Duc Minh Vo, Akihiro Sugimoto

Abstract: We propose an end-to-end network for image generation from given structured-text that consists of the visual-relation layout module and the pyramid of GANs, namely stacking-GANs. Our visual-relation layout module uses relations among entities in the structured-text in two ways: comprehensive usage and individual usage. We comprehensively use all available relations together to localize initial bou… ▽ More We propose an end-to-end network for image generation from given structured-text that consists of the visual-relation layout module and the pyramid of GANs, namely stacking-GANs. Our visual-relation layout module uses relations among entities in the structured-text in two ways: comprehensive usage and individual usage. We comprehensively use all available relations together to localize initial bounding-boxes of all the entities. We also use individual relation separately to predict from the initial bounding-boxes relation-units for all the relations in the input text. We then unify all the relation-units to produce the visual-relation layout, i.e., bounding-boxes for all the entities so that each of them uniquely corresponds to each entity while kee** its involved relations. Our visual-relation layout reflects the scene structure given in the input text. The stacking-GANs is the stack of three GANs conditioned on the visual-relation layout and the output of previous GAN, consistently capturing the scene structure. Our network realistically renders entities' details in high resolution while kee** the scene structure. Experimental results on two public datasets show outperformances of our method against state-of-the-art methods. △ Less

Submitted 18 July, 2020; v1 submitted 5 August, 2019; originally announced August 2019.

Comments: accepted at ECCV 2020

arXiv:1805.08717 [pdf, other]

doi 10.1109/TPAMI.2020.2974726

Self-supervised Multi-view Person Association and Its Applications

Authors: Minh Vo, Ersin Yumer, Kalyan Sunkavalli, Sunil Hadap, Yaser Sheikh, Srinivasa Narasimhan

Abstract: Reliable markerless motion tracking of people participating in a complex group activity from multiple moving cameras is challenging due to frequent occlusions, strong viewpoint and appearance variations, and asynchronous video streams. To solve this problem, reliable association of the same person across distant viewpoints and temporal instances is essential. We present a self-supervised framework… ▽ More Reliable markerless motion tracking of people participating in a complex group activity from multiple moving cameras is challenging due to frequent occlusions, strong viewpoint and appearance variations, and asynchronous video streams. To solve this problem, reliable association of the same person across distant viewpoints and temporal instances is essential. We present a self-supervised framework to adapt a generic person appearance descriptor to the unlabeled videos by exploiting motion tracking, mutual exclusion constraints, and multi-view geometry. The adapted discriminative descriptor is used in a tracking-by-clustering formulation. We validate the effectiveness of our descriptor learning on WILDTRACK [14] and three new complex social scenes captured by multiple cameras with up to 60 people "in the wild". We report significant improvement in association accuracy (up to 18%) and stable and coherent 3D human skeleton tracking (5 to 10 times) over the baseline. Using the reconstructed 3D skeletons, we cut the input videos into a multi-angle video where the image of a specified person is shown from the best visible front-facing camera. Our algorithm detects inter-human occlusion to determine the camera switching moment while still maintaining the flow of the action well. △ Less

Submitted 18 April, 2020; v1 submitted 22 May, 2018; originally announced May 2018.

Comments: Accepted to IEEE TPAMI

arXiv:1704.05314 [pdf, ps, other]

The local backward heat problem

Authors: Thi Minh Nhat Vo

Abstract: In this paper, we study the local backward problem of a linear heat equation with time-dependent coefficients under the Dirichlet boundary condition. Precisely, we recover the initial data from the observation on a subdomain at some later time. Thanks to the "optimal filtering" method of Seidman, we can solve the global backward problem, which determines the solution at initial time from the known… ▽ More In this paper, we study the local backward problem of a linear heat equation with time-dependent coefficients under the Dirichlet boundary condition. Precisely, we recover the initial data from the observation on a subdomain at some later time. Thanks to the "optimal filtering" method of Seidman, we can solve the global backward problem, which determines the solution at initial time from the known data on the whole domain. Then, by using a result of controllability at one point of time, we can connect local and global backward problem. △ Less

Submitted 18 April, 2017; originally announced April 2017.

arXiv:hep-ex/0104012 [pdf, ps, other]

doi 10.1016/S0168-9002(01)01371-7

The Drift Chambers Of The Nomad Experiment

Authors: M. Anfreville, P. Astier, M. Authier, A. Baldisseri, M. Banner, N. Besson, J. Bouchez, A. Castera, O. Cloue, J. Dumarchez, L. Dumps, E. Gangler, J. Gosset, C. Hagner, C. Jollec, C. Lachaud, A. Letessier, J. M. Levy, L. Linssen, J. P. Meyer, J. P. Ouriet, J. P. Passerieux, T. Pedrol, A. Placci, J. Poinsignon , et al. (8 additional authors not shown)

Abstract: We present a detailed description of the drift chambers used as an active target and a tracking device in the NOMAD experiment at CERN. The main characteristics of these chambers are a large area, a self supporting structure made of light composite materials and a low cost. A spatial resolution of 150 microns has been achieved with a single hit efficiency of 97%. We present a detailed description of the drift chambers used as an active target and a tracking device in the NOMAD experiment at CERN. The main characteristics of these chambers are a large area, a self supporting structure made of light composite materials and a low cost. A spatial resolution of 150 microns has been achieved with a single hit efficiency of 97%. △ Less

Submitted 8 June, 2001; v1 submitted 9 April, 2001; originally announced April 2001.

Comments: 42 pages, 26 figures

Report number: LPNHE-01-01

Journal ref: Nucl.Instrum.Meth. A481 (2002) 339-364

Showing 1–40 of 40 results for author: Vo, M