Search | arXiv e-print repository

LLM-Powered Explanations: Unraveling Recommendations Through Subgraph Reasoning

Authors: Guangsi Shi, Xiaofeng Deng, Linhao Luo, Lijuan Xia, Lei Bao, Bei Ye, Fei Du, Shirui Pan, Yuxiao Li

Abstract: Recommender systems are pivotal in enhancing user experiences across various web applications by analyzing the complicated relationships between users and items. Knowledge graphs(KGs) have been widely used to enhance the performance of recommender systems. However, KGs are known to be noisy and incomplete, which are hard to provide reliable explanations for recommendation results. An explainable r… ▽ More Recommender systems are pivotal in enhancing user experiences across various web applications by analyzing the complicated relationships between users and items. Knowledge graphs(KGs) have been widely used to enhance the performance of recommender systems. However, KGs are known to be noisy and incomplete, which are hard to provide reliable explanations for recommendation results. An explainable recommender system is crucial for the product development and subsequent decision-making. To address these challenges, we introduce a novel recommender that synergies Large Language Models (LLMs) and KGs to enhance the recommendation and provide interpretable results. Specifically, we first harness the power of LLMs to augment KG reconstruction. LLMs comprehend and decompose user reviews into new triples that are added into KG. In this way, we can enrich KGs with explainable paths that express user preferences. To enhance the recommendation on augmented KGs, we introduce a novel subgraph reasoning module that effectively measures the importance of nodes and discovers reasoning for recommendation. Finally, these reasoning paths are fed into the LLMs to generate interpretable explanations of the recommendation results. Our approach significantly enhances both the effectiveness and interpretability of recommender systems, especially in cross-selling scenarios where traditional methods falter. The effectiveness of our approach has been rigorously tested on four open real-world datasets, with our methods demonstrating a superior performance over contemporary state-of-the-art techniques by an average improvement of 12%. The application of our model in a multinational engineering and technology company cross-selling recommendation system further underscores its practical utility and potential to redefine recommendation practices through improved accuracy and user trust. △ Less

Submitted 29 June, 2024; v1 submitted 22 June, 2024; originally announced June 2024.

arXiv:2406.07006 [pdf, other]

MIPI 2024 Challenge on Few-shot RAW Image Denoising: Methods and Results

Authors: Xin **, Chunle Guo, Xiaoming Li, Zongsheng Yue, Chongyi Li, Shangchen Zhou, Ruicheng Feng, Yuekun Dai, Peiqing Yang, Chen Change Loy, Ruoqi Li, Chang Liu, Ziyi Wang, Yao Du, **g**g Yang, Long Bao, Heng Sun, Xiangyu Kong, Xiaoxia Xing, **long Wu, Yuanyang Xue, Hyunhee Park, Sejun Song, Changho Kim, **gfan Tan , et al. (17 additional authors not shown)

Abstract: The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra… ▽ More The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Few-shot RAW Image Denoising track on MIPI 2024. In total, 165 participants were successfully registered, and 7 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art erformance on Few-shot RAW Image Denoising. More details of this challenge and the link to the dataset can be found at https://mipichallenge.org/MIPI2024. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: CVPR 2024 Mobile Intelligent Photography and Imaging (MIPI) Workshop--Few-shot RAWImage Denoising Challenge Report. Website: https://mipi-challenge.org/MIPI2024/

arXiv:2405.15705 [pdf, other]

Sums: Sniffing Unknown Multiband Signals under Low Sampling Rates

Authors: **bo Peng, Zhe Chen, Zheng Lin, Haoxuan Yuan, Zihan Fang, Lingzhong Bao, Zihang Song, Ying Li, **g Ren, Yue Gao

Abstract: Due to sophisticated deployments of all kinds of wireless networks (e.g., 5G, Wi-Fi, Bluetooth, LEO satellite, etc.), multiband signals distribute in a large bandwidth (e.g., from 70 MHz to 8 GHz). Consequently, for network monitoring and spectrum sharing applications, a sniffer for extracting physical layer information, such as structure of packet, with low sampling rate (especially, sub-Nyquist… ▽ More Due to sophisticated deployments of all kinds of wireless networks (e.g., 5G, Wi-Fi, Bluetooth, LEO satellite, etc.), multiband signals distribute in a large bandwidth (e.g., from 70 MHz to 8 GHz). Consequently, for network monitoring and spectrum sharing applications, a sniffer for extracting physical layer information, such as structure of packet, with low sampling rate (especially, sub-Nyquist sampling) can significantly improve their cost- and energy-efficiency. However, to achieve a multiband signals sniffer is really a challenge. To this end, we propose Sums, a system that can sniff and analyze multiband signals in a blind manner. Our Sums takes advantage of hardware and algorithm co-design, multi-coset sub-Nyquist sampling hardware, and a multi-task deep learning framework. The hardware component breaks the Nyquist rule to sample GHz bandwidth, but only pays for a 50 MSPS sampling rate. Our multi-task learning framework directly tackles the sampling data to perform spectrum sensing, physical layer protocol recognition, and demodulation for deep inspection from multiband signals. Extensive experiments demonstrate that Sums achieves higher accuracy than the state-of-theart baselines in spectrum sensing, modulation classification, and demodulation. As a result, our Sums can help researchers and end-users to diagnose or troubleshoot their problems of wireless infrastructures deployments in practice. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: 12 pages, 9 figures

arXiv:2405.07655 [pdf, other]

doi 10.1109/TIP.2024.3393365

Quality-aware Selective Fusion Network for V-D-T Salient Object Detection

Authors: Liuxin Bao, Xiaofei Zhou, Xiankai Lu, Yaoqi Sun, Haibing Yin, Zhenghui Hu, Jiyong Zhang, Chenggang Yan

Abstract: Depth images and thermal images contain the spatial geometry information and surface temperature information, which can act as complementary information for the RGB modality. However, the quality of the depth and thermal images is often unreliable in some challenging scenarios, which will result in the performance degradation of the two-modal based salient object detection (SOD). Meanwhile, some r… ▽ More Depth images and thermal images contain the spatial geometry information and surface temperature information, which can act as complementary information for the RGB modality. However, the quality of the depth and thermal images is often unreliable in some challenging scenarios, which will result in the performance degradation of the two-modal based salient object detection (SOD). Meanwhile, some researchers pay attention to the triple-modal SOD task, where they attempt to explore the complementarity of the RGB image, the depth image, and the thermal image. However, existing triple-modal SOD methods fail to perceive the quality of depth maps and thermal images, which leads to performance degradation when dealing with scenes with low-quality depth and thermal images. Therefore, we propose a quality-aware selective fusion network (QSF-Net) to conduct VDT salient object detection, which contains three subnets including the initial feature extraction subnet, the quality-aware region selection subnet, and the region-guided selective fusion subnet. Firstly, except for extracting features, the initial feature extraction subnet can generate a preliminary prediction map from each modality via a shrinkage pyramid architecture. Then, we design the weakly-supervised quality-aware region selection subnet to generate the quality-aware maps. Concretely, we first find the high-quality and low-quality regions by using the preliminary predictions, which further constitute the pseudo label that can be used to train this subnet. Finally, the region-guided selective fusion subnet purifies the initial features under the guidance of the quality-aware maps, and then fuses the triple-modal features and refines the edge details of prediction maps through the intra-modality and inter-modality attention (IIA) module and the edge refinement (ER) module, respectively. Extensive experiments are performed on VDT-2048 △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: Accepted by IEEE Transactions on Image Processing (TIP)

arXiv:2405.03509 [pdf, other]

doi 10.1145/3660811

Are Human Rules Necessary? Generating Reusable APIs with CoT Reasoning and In-Context Learning

Authors: Yubo Mai, Zhipeng Gao, Xing Hu, Lingfeng Bao, Yu Liu, Jianling Sun

Abstract: Inspired by the great potential of Large Language Models (LLMs) for solving complex coding tasks, in this paper, we propose a novel approach, named Code2API, to automatically perform APIzation for Stack Overflow code snippets. Code2API does not require additional model training or any manual crafting rules and can be easily deployed on personal computers without relying on other external tools. Sp… ▽ More Inspired by the great potential of Large Language Models (LLMs) for solving complex coding tasks, in this paper, we propose a novel approach, named Code2API, to automatically perform APIzation for Stack Overflow code snippets. Code2API does not require additional model training or any manual crafting rules and can be easily deployed on personal computers without relying on other external tools. Specifically, Code2API guides the LLMs through well-designed prompts to generate well-formed APIs for given code snippets. To elicit knowledge and logical reasoning from LLMs, we used chain-of-thought (CoT) reasoning and few-shot in-context learning, which can help the LLMs fully understand the APIzation task and solve it step by step in a manner similar to a developer. Our evaluations show that Code2API achieves a remarkable accuracy in identifying method parameters (65%) and return statements (66%) equivalent to human-generated ones, surpassing the current state-of-the-art approach, APIzator, by 15.0% and 16.5% respectively. Moreover, compared with APIzator, our user study demonstrates that Code2API exhibits superior performance in generating meaningful method names, even surpassing the human-level performance, and developers are more willing to use APIs generated by our approach, highlighting the applicability of our tool in practice. Finally, we successfully extend our framework to the Python dataset, achieving a comparable performance with Java, which verifies the generalizability of our tool. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2404.19026 [pdf, other]

MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing

Authors: Cong Wang, Di Kang, He-Yi Sun, Shen-Han Qian, Zi-Xuan Wang, Linchao Bao, Song-Hai Zhang

Abstract: Creating high-fidelity head avatars from multi-view videos is a core issue for many AR/VR applications. However, existing methods usually struggle to obtain high-quality renderings for all different head components simultaneously since they use one single representation to model components with drastically different characteristics (e.g., skin vs. hair). In this paper, we propose a Hybrid Mesh-Gau… ▽ More Creating high-fidelity head avatars from multi-view videos is a core issue for many AR/VR applications. However, existing methods usually struggle to obtain high-quality renderings for all different head components simultaneously since they use one single representation to model components with drastically different characteristics (e.g., skin vs. hair). In this paper, we propose a Hybrid Mesh-Gaussian Head Avatar (MeGA) that models different head components with more suitable representations. Specifically, we select an enhanced FLAME mesh as our facial representation and predict a UV displacement map to provide per-vertex offsets for improved personalized geometric details. To achieve photorealistic renderings, we obtain facial colors using deferred neural rendering and disentangle neural textures into three meaningful parts. For hair modeling, we first build a static canonical hair using 3D Gaussian Splatting. A rigid transformation and an MLP-based deformation field are further applied to handle complex dynamic expressions. Combined with our occlusion-aware blending, MeGA generates higher-fidelity renderings for the whole head and naturally supports more downstream tasks. Experiments on the NeRSemble dataset demonstrate the effectiveness of our designs, outperforming previous state-of-the-art methods and supporting various editing functionalities, including hairstyle alteration and texture editing. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: Project page: https://conallwang.github.io/MeGA_Pages/

arXiv:2404.17070 [pdf, other]

Deep Reinforcement Learning for Bipedal Locomotion: A Brief Survey

Authors: Lingfan Bao, Joseph Humphreys, Tianhu Peng, Chengxu Zhou

Abstract: Bipedal robots are garnering increasing global attention due to their potential applications and advancements in artificial intelligence, particularly in Deep Reinforcement Learning (DRL). While DRL has driven significant progress in bipedal locomotion, develo** a comprehensive and unified framework capable of adeptly performing a wide range of tasks remains a challenge. This survey systematical… ▽ More Bipedal robots are garnering increasing global attention due to their potential applications and advancements in artificial intelligence, particularly in Deep Reinforcement Learning (DRL). While DRL has driven significant progress in bipedal locomotion, develo** a comprehensive and unified framework capable of adeptly performing a wide range of tasks remains a challenge. This survey systematically categorizes, compares, and summarizes existing DRL frameworks for bipedal locomotion, organizing them into end-to-end and hierarchical control schemes. End-to-end frameworks are assessed based on their learning approaches, whereas hierarchical frameworks are dissected into layers that utilize either learning-based methods or traditional model-based approaches. This survey provides a detailed analysis of the composition, capabilities, strengths, and limitations of each framework type. Furthermore, we identify critical research gaps and propose future directions aimed at achieving a more integrated and efficient framework for bipedal locomotion, with potential broad applications in everyday life. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 14 pages, 4 figures

arXiv:2403.09601 [pdf, other]

Network-Controlled Repeater -- An Introduction

Authors: Fco. Italo G. Carvalho, Raul Victor de O. Paiva, Tarcisio F. Maciel, Victor F. Monteiro, Fco. Rafael M. Lima, Darlan C. Moreira, Diego A. Sousa, Behrooz Makki, Magnus Astrom, Lei Bao

Abstract: In fifth generation (5G) wireless cellular networks, millimeter wave spectrum opens room for several potential improvements in throughput, reliability, latency, among other aspects. However, it also brings challenges, such as a higher influence of blockage which may significantly limit the coverage. In this context, network-controlled repeaters (NCRs) are network nodes with low complexity that rep… ▽ More In fifth generation (5G) wireless cellular networks, millimeter wave spectrum opens room for several potential improvements in throughput, reliability, latency, among other aspects. However, it also brings challenges, such as a higher influence of blockage which may significantly limit the coverage. In this context, network-controlled repeaters (NCRs) are network nodes with low complexity that represent a technique to overcome coverage problems. In this paper, we introduce the NCR concept and study its performance gains and deployment options. Particularly, presenting the main specifications of NCR as agreed in 3rd generation partnership project (3GPP) Rel-18, we analyze different NCR deployments in an urban scenario and compare its performance with alternative deployments. As demonstrated, with a proper network planning and beamforming design, NCR is an attractive solution to cover blind spots the base stations (BSs) may have. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: Submmited to IEEE Communications Standards Magazine

arXiv:2401.01078 [pdf, other]

Vietnamese Poem Generation & The Prospect Of Cross-Language Poem-To-Poem Translation

Authors: Triet Minh Huynh, Quan Le Bao

Abstract: Poetry generation has been a challenging task in the field of Natural Language Processing, as it requires the model to understand the nuances of language, sentiment, and style. In this paper, we propose using Large Language Models to generate Vietnamese poems of various genres from natural language prompts, thereby facilitating an intuitive process with enhanced content control. Our most efficacio… ▽ More Poetry generation has been a challenging task in the field of Natural Language Processing, as it requires the model to understand the nuances of language, sentiment, and style. In this paper, we propose using Large Language Models to generate Vietnamese poems of various genres from natural language prompts, thereby facilitating an intuitive process with enhanced content control. Our most efficacious model, the GPT-3 Babbage variant, achieves a custom evaluation score of 0.8, specifically tailored to the "luc bat" genre of Vietnamese poetry. Furthermore, we also explore the idea of paraphrasing poems into normal text prompts and yield a relatively high score of 0.781 in the "luc bat" genre. This experiment presents the potential for cross-Language poem-to-poem translation with translated poems as the inputs while concurrently maintaining complete control over the generated content. △ Less

Submitted 4 January, 2024; v1 submitted 2 January, 2024; originally announced January 2024.

arXiv:2307.05000 [pdf, other]

Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar

Authors: Cong Wang, Di Kang, Yan-Pei Cao, Linchao Bao, Ying Shan, Song-Hai Zhang

Abstract: Rendering photorealistic and dynamically moving human heads is crucial for ensuring a pleasant and immersive experience in AR/VR and video conferencing applications. However, existing methods often struggle to model challenging facial regions (e.g., mouth interior, eyes, hair/beard), resulting in unrealistic and blurry results. In this paper, we propose {\fullname} ({\name}), a method that adopts… ▽ More Rendering photorealistic and dynamically moving human heads is crucial for ensuring a pleasant and immersive experience in AR/VR and video conferencing applications. However, existing methods often struggle to model challenging facial regions (e.g., mouth interior, eyes, hair/beard), resulting in unrealistic and blurry results. In this paper, we propose {\fullname} ({\name}), a method that adopts the neural point representation as well as the neural volume rendering process and discards the predefined connectivity and hard correspondence imposed by mesh-based approaches. Specifically, the neural points are strategically constrained around the surface of the target expression via a high-resolution UV displacement map, achieving increased modeling capacity and more accurate control. We introduce three technical innovations to improve the rendering and training efficiency: a patch-wise depth-guided (shading point) sampling strategy, a lightweight radiance decoding process, and a Grid-Error-Patch (GEP) ray sampling strategy during training. By design, our {\name} is better equipped to handle topologically changing regions and thin structures while also ensuring accurate expression control when animating avatars. Experiments conducted on three subjects from the Multiface dataset demonstrate the effectiveness of our designs, outperforming previous state-of-the-art methods, especially in handling challenging facial regions. △ Less

Submitted 13 October, 2023; v1 submitted 10 July, 2023; originally announced July 2023.

Comments: Accepted by SIGGRAPH Asia 2023

arXiv:2304.05554 [pdf, other]

Learning Transferable Pedestrian Representation from Multimodal Information Supervision

Authors: Li** Bao, Longhui Wei, Xiaoyu Qiu, Wengang Zhou, Houqiang Li, Qi Tian

Abstract: Recent researches on unsupervised person re-identification~(reID) have demonstrated that pre-training on unlabeled person images achieves superior performance on downstream reID tasks than pre-training on ImageNet. However, those pre-trained methods are specifically designed for reID and suffer flexible adaption to other pedestrian analysis tasks. In this paper, we propose VAL-PAT, a novel framewo… ▽ More Recent researches on unsupervised person re-identification~(reID) have demonstrated that pre-training on unlabeled person images achieves superior performance on downstream reID tasks than pre-training on ImageNet. However, those pre-trained methods are specifically designed for reID and suffer flexible adaption to other pedestrian analysis tasks. In this paper, we propose VAL-PAT, a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information. To train our framework, we introduce three learning objectives, \emph{i.e.,} self-supervised contrastive learning, image-text contrastive learning and multi-attribute classification. The self-supervised contrastive learning facilitates the learning of the intrinsic pedestrian properties, while the image-text contrastive learning guides the model to focus on the appearance information of pedestrians.Meanwhile, multi-attribute classification encourages the model to recognize attributes to excavate fine-grained pedestrian information. We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations, and then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search. Extensive experiments demonstrate that our framework facilitates the learning of general pedestrian representations and thus leads to promising results on various pedestrian analysis tasks. △ Less

Submitted 11 April, 2023; originally announced April 2023.

arXiv:2303.15967 [pdf, other]

CM-CASL: Comparison-based Performance Modeling of Software Systems via Collaborative Active and Semisupervised Learning

Authors: Rong Cao, Liang Bao, Chase Wu, Panpan Zhangsun, Yufei Li, Zhe Zhang

Abstract: Configuration tuning for large software systems is generally challenging due to the complex configuration space and expensive performance evaluation. Most existing approaches follow a two-phase process, first learning a regression-based performance prediction model on available samples and then searching for the configurations with satisfactory performance using the learned model. Such regression-… ▽ More Configuration tuning for large software systems is generally challenging due to the complex configuration space and expensive performance evaluation. Most existing approaches follow a two-phase process, first learning a regression-based performance prediction model on available samples and then searching for the configurations with satisfactory performance using the learned model. Such regression-based models often suffer from the scarcity of samples due to the enormous time and resources required to run a large software system with a specific configuration. Moreover, previous studies have shown that even a highly accurate regression-based model may fail to discern the relative merit between two configurations, whereas performance comparison is actually one fundamental strategy for configuration tuning. To address these issues, this paper proposes CM-CASL, a Comparison-based performance Modeling approach for software systems via Collaborative Active and Semisupervised Learning. CM-CASL learns a classification model that compares the performance of two given configurations, and enhances the samples through a collaborative labeling process by both human experts and classifiers using an integration of active and semisupervised learning. Experimental results demonstrate that CM-CASL outperforms two state-of-the-art performance modeling approaches in terms of both classification accuracy and rank accuracy, and thus provides a better performance model for the subsequent work of configuration tuning. △ Less

Submitted 28 March, 2023; originally announced March 2023.

arXiv:2303.08658 [pdf, other]

Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry

Authors: Jiaxu Zhang, Junwu Weng, Di Kang, Fang Zhao, Shaoli Huang, Xuefei Zhe, Linchao Bao, Ying Shan, Jue Wang, Zhigang Tu

Abstract: A good motion retargeting cannot be reached without reasonable consideration of source-target differences on both the skeleton and shape geometry levels. In this work, we propose a novel Residual RETargeting network (R2ET) structure, which relies on two neural modification modules, to adjust the source motions to fit the target skeletons and shapes progressively. In particular, a skeleton-aware mo… ▽ More A good motion retargeting cannot be reached without reasonable consideration of source-target differences on both the skeleton and shape geometry levels. In this work, we propose a novel Residual RETargeting network (R2ET) structure, which relies on two neural modification modules, to adjust the source motions to fit the target skeletons and shapes progressively. In particular, a skeleton-aware module is introduced to preserve the source motion semantics. A shape-aware module is designed to perceive the geometries of target characters to reduce interpenetration and contact-missing. Driven by our explored distance-based losses that explicitly model the motion semantics and geometry, these two modules can learn residual motion modifications on the source motion to generate plausible retargeted motion in a single inference without post-processing. To balance these two modifications, we further present a balancing gate to conduct linear interpolation between them. Extensive experiments on the public dataset Mixamo demonstrate that our R2ET achieves the state-of-the-art performance, and provides a good balance between the preservation of motion semantics as well as the attenuation of interpenetration and contact-missing. Code is available at https://github.com/Kebii/R2ET. △ Less

Submitted 15 March, 2023; originally announced March 2023.

Comments: CVPR 2023

arXiv:2303.07744 [pdf, other]

Sliding at first order: Higher-order momentum distributions for discontinuous image registration

Authors: Lili Bao, Jiahao Lu, Shihui Ying, Stefan Sommer

Abstract: In this paper, we propose a new approach to deformable image registration that captures sliding motions. The large deformation diffeomorphic metric map** (LDDMM) registration method faces challenges in representing sliding motion since it per construction generates smooth warps. To address this issue, we extend LDDMM by incorporating both zeroth- and first-order momenta with a non-differentiable… ▽ More In this paper, we propose a new approach to deformable image registration that captures sliding motions. The large deformation diffeomorphic metric map** (LDDMM) registration method faces challenges in representing sliding motion since it per construction generates smooth warps. To address this issue, we extend LDDMM by incorporating both zeroth- and first-order momenta with a non-differentiable kernel. This allows to represent both discontinuous deformation at switching boundaries and diffeomorphic deformation in homogeneous regions. We provide a mathematical analysis of the proposed deformation model from the viewpoint of discontinuous systems. To evaluate our approach, we conduct experiments on both artificial images and the publicly available DIR-Lab 4DCT dataset. Results show the effectiveness of our approach in capturing plausible sliding motion. △ Less

Submitted 14 March, 2023; originally announced March 2023.

MSC Class: 65D18; 65K10; 34A36; 68U10

arXiv:2302.01162 [pdf, other]

Get3DHuman: Lifting StyleGAN-Human into a 3D Generative Model using Pixel-aligned Reconstruction Priors

Authors: Zhangyang Xiong, Di Kang, Derong **, Weikai Chen, Linchao Bao, Shuguang Cui, Xiaoguang Han

Abstract: Fast generation of high-quality 3D digital humans is important to a vast number of applications ranging from entertainment to professional concerns. Recent advances in differentiable rendering have enabled the training of 3D generative models without requiring 3D ground truths. However, the quality of the generated 3D humans still has much room to improve in terms of both fidelity and diversity. I… ▽ More Fast generation of high-quality 3D digital humans is important to a vast number of applications ranging from entertainment to professional concerns. Recent advances in differentiable rendering have enabled the training of 3D generative models without requiring 3D ground truths. However, the quality of the generated 3D humans still has much room to improve in terms of both fidelity and diversity. In this paper, we present Get3DHuman, a novel 3D human framework that can significantly boost the realism and diversity of the generated outcomes by only using a limited budget of 3D ground-truth data. Our key observation is that the 3D generator can profit from human-related priors learned through 2D human generators and 3D reconstructors. Specifically, we bridge the latent space of Get3DHuman with that of StyleGAN-Human via a specially-designed prior network, where the input latent code is mapped to the shape and texture feature volumes spanned by the pixel-aligned 3D reconstructor. The outcomes of the prior network are then leveraged as the supervisory signals for the main generator network. To ensure effective training, we further propose three tailored losses applied to the generated feature volumes and the intermediate feature maps. Extensive experiments demonstrate that Get3DHuman greatly outperforms the other state-of-the-art approaches and can support a wide range of applications including shape interpolation, shape re-texturing, and single-view reconstruction through latent inversion. △ Less

Submitted 24 July, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

Comments: ICCV 2023, project page: https://x-zhangyang.github.io/2023_Get3DHuman/

arXiv:2301.11546 [pdf, other]

Adapting Step-size: A Unified Perspective to Analyze and Improve Gradient-based Methods for Adversarial Attacks

Authors: Wei Tao, Lei Bao, Sheng Long, Gaowei Wu, Qing Tao

Abstract: Learning adversarial examples can be formulated as an optimization problem of maximizing the loss function with some box-constraints. However, for solving this induced optimization problem, the state-of-the-art gradient-based methods such as FGSM, I-FGSM and MI-FGSM look different from their original methods especially in updating the direction, which makes it difficult to understand them and then… ▽ More Learning adversarial examples can be formulated as an optimization problem of maximizing the loss function with some box-constraints. However, for solving this induced optimization problem, the state-of-the-art gradient-based methods such as FGSM, I-FGSM and MI-FGSM look different from their original methods especially in updating the direction, which makes it difficult to understand them and then leaves some theoretical issues to be addressed in viewpoint of optimization. In this paper, from the perspective of adapting step-size, we provide a unified theoretical interpretation of these gradient-based adversarial learning methods. We show that each of these algorithms is in fact a specific reformulation of their original gradient methods but using the step-size rules with only current gradient information. Motivated by such analysis, we present a broad class of adaptive gradient-based algorithms based on the regular gradient methods, in which the step-size strategy utilizing information of the accumulated gradients is integrated. Such adaptive step-size strategies directly normalize the scale of the gradients rather than use some empirical operations. The important benefit is that convergence for the iterative algorithms is guaranteed and then the whole optimization process can be stabilized. The experiments demonstrate that our AdaI-FGM consistently outperforms I-FGSM and AdaMI-FGM remains competitive with MI-FGSM for black-box attacks. △ Less

Submitted 1 February, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

arXiv:2301.06690 [pdf, other]

Audio2Gestures: Generating Diverse Gestures from Audio

Authors: **g Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Linchao Bao, Zhenyu He

Abstract: People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one map**, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during infe… ▽ More People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one map**, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion map** by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial training losses/strategies, including relaxed motion loss, bicycle constraint, and diversity loss, are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than previous state-of-the-art methods, quantitatively and qualitatively. Besides, our formulation is compatible with discrete cosine transformation (DCT) modeling and other popular backbones (\textit{i.e.} RNN, Transformer). As for motion losses and quantitative motion evaluation, we find structured losses/metrics (\textit{e.g.} STFT) that consider temporal and/or spatial context complement the most commonly used point-wise losses (\textit{e.g.} PCK), resulting in better motion dynamics and more nuanced motion details. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. △ Less

Submitted 16 January, 2023; originally announced January 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2108.06720

arXiv:2301.06059 [pdf, other]

Learning Audio-Driven Viseme Dynamics for 3D Face Animation

Authors: Linchao Bao, Haoxian Zhang, Yue Qian, Tangli Xue, Changhai Chen, Xuefei Zhe, Di Kang

Abstract: We present a novel audio-driven facial animation approach that can generate realistic lip-synchronized 3D facial animations from the input audio. Our approach learns viseme dynamics from speech videos, produces animator-friendly viseme curves, and supports multilingual speech inputs. The core of our approach is a novel parametric viseme fitting algorithm that utilizes phoneme priors to extract vis… ▽ More We present a novel audio-driven facial animation approach that can generate realistic lip-synchronized 3D facial animations from the input audio. Our approach learns viseme dynamics from speech videos, produces animator-friendly viseme curves, and supports multilingual speech inputs. The core of our approach is a novel parametric viseme fitting algorithm that utilizes phoneme priors to extract viseme parameters from speech videos. With the guidance of phonemes, the extracted viseme curves can better correlate with phonemes, thus more controllable and friendly to animators. To support multilingual speech inputs and generalizability to unseen voices, we take advantage of deep audio feature models pretrained on multiple languages to learn the map** from audio to viseme curves. Our audio-to-curves map** achieves state-of-the-art performance even when the input audio suffers from distortions of volume, pitch, speed, or noise. Lastly, a viseme scanning approach for acquiring high-fidelity viseme assets is presented for efficient speech animation production. We show that the predicted viseme curves can be applied to different viseme-rigged characters to yield various personalized animations with realistic and natural facial motions. Our approach is artist-friendly and can be easily integrated into typical animation production workflows including blendshape or bone based animation. △ Less

Submitted 15 January, 2023; originally announced January 2023.

Comments: Project page: https://linchaobao.github.io/viseme2023/

arXiv:2301.04258 [pdf, other]

CARD: Semantic Segmentation with Efficient Class-Aware Regularized Decoder

Authors: Ye Huang, Di Kang, Liang Chen, Wen**g Jia, Xiangjian He, Lixin Duan, Xuefei Zhe, Linchao Bao

Abstract: Semantic segmentation has recently achieved notable advances by exploiting "class-level" contextual information during learning. However, these approaches simply concatenate class-level information to pixel features to boost the pixel representation learning, which cannot fully utilize intra-class and inter-class contextual information. Moreover, these approaches learn soft class centers based on… ▽ More Semantic segmentation has recently achieved notable advances by exploiting "class-level" contextual information during learning. However, these approaches simply concatenate class-level information to pixel features to boost the pixel representation learning, which cannot fully utilize intra-class and inter-class contextual information. Moreover, these approaches learn soft class centers based on coarse mask prediction, which is prone to error accumulation. To better exploit class level information, we propose a universal Class-Aware Regularization (CAR) approach to optimize the intra-class variance and inter-class distance during feature learning, motivated by the fact that humans can recognize an object by itself no matter which other objects it appears with. Moreover, we design a dedicated decoder for CAR (CARD), which consists of a novel spatial token mixer and an upsampling module, to maximize its gain for existing baselines while being highly efficient in terms of computational cost. Specifically, CAR consists of three novel loss functions. The first loss function encourages more compact class representations within each class, the second directly maximizes the distance between different class centers, and the third further pushes the distance between inter-class centers and pixels. Furthermore, the class center in our approach is directly generated from ground truth instead of from the error-prone coarse prediction. CAR can be directly applied to most existing segmentation models during training, and can largely improve their accuracy at no additional inference overhead. Extensive experiments and ablation studies conducted on multiple benchmark datasets demonstrate that the proposed CAR can boost the accuracy of all baseline models by up to 2.23% mIOU with superior generalization ability. CARD outperforms SOTA approaches on multiple benchmarks with a highly efficient architecture. △ Less

Submitted 10 January, 2023; originally announced January 2023.

Comments: Tech report, text extended from arXiv:2203.07160

arXiv:2211.13874 [pdf, other]

FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction

Authors: Haoran Bai, Di Kang, Haoxian Zhang, **shan Pan, Linchao Bao

Abstract: We present a large-scale facial UV-texture dataset that contains over 50,000 high-quality texture UV-maps with even illuminations, neutral expressions, and cleaned facial regions, which are desired characteristics for rendering realistic 3D face models under different lighting conditions. The dataset is derived from a large-scale face image dataset namely FFHQ, with the help of our fully automatic… ▽ More We present a large-scale facial UV-texture dataset that contains over 50,000 high-quality texture UV-maps with even illuminations, neutral expressions, and cleaned facial regions, which are desired characteristics for rendering realistic 3D face models under different lighting conditions. The dataset is derived from a large-scale face image dataset namely FFHQ, with the help of our fully automatic and robust UV-texture production pipeline. Our pipeline utilizes the recent advances in StyleGAN-based facial image editing approaches to generate multi-view normalized face images from single-image inputs. An elaborated UV-texture extraction, correction, and completion procedure is then applied to produce high-quality UV-maps from the normalized face images. Compared with existing UV-texture datasets, our dataset has more diverse and higher-quality texture maps. We further train a GAN-based texture decoder as the nonlinear texture basis for parametric fitting based 3D face reconstruction. Experiments show that our method improves the reconstruction accuracy over state-of-the-art approaches, and more importantly, produces high-quality texture maps that are ready for realistic renderings. The dataset, code, and pre-trained texture decoder are publicly available at https://github.com/csbhr/FFHQ-UV. △ Less

Submitted 24 March, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

Comments: The dataset, code, and pre-trained texture decoder are publicly available at https://github.com/csbhr/FFHQ-UV

arXiv:2211.06974 [pdf]

A Comparison between Network-Controlled Repeaters and Reconfigurable Intelligent Surfaces

Authors: Hao Guo, Charitha Madapatha, Behrooz Makki, Boris Dortschy, Lei Bao, Magnus Åström, Tommy Svensson

Abstract: Network-controlled repeater (NCR) has been recently considered as a study-item in 3GPP Release 18, and the discussions are continuing in a work-item. In this paper, we introduce the concept of NCRs, as a possible low-complexity device to support for network densification and compare the performance of the NCRs with those achieved by reconfigurable intelligent surfaces (RISs). The results are prese… ▽ More Network-controlled repeater (NCR) has been recently considered as a study-item in 3GPP Release 18, and the discussions are continuing in a work-item. In this paper, we introduce the concept of NCRs, as a possible low-complexity device to support for network densification and compare the performance of the NCRs with those achieved by reconfigurable intelligent surfaces (RISs). The results are presented for the cases with different beamforming methods and hardware impairment models of the RIS. Moreover, we introduce the objectives of the 3GPP Release 18 NCR work-item and study the effect of different parameters on the performance of NCR-assisted networks. As we show, with a proper deployment, the presence of NCRs/RISs can improve the network performance considerably. △ Less

Submitted 13 November, 2022; originally announced November 2022.

Comments: 7 pages, 7 figures, submitted to potential IEEE publication

arXiv:2211.05256 [pdf, other]

Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: Report

Authors: Andrey Ignatov, Radu Timofte, Cheng-Ming Chiang, Hsien-Kai Kuo, Yu-Syuan Xu, Man-Yu Lee, Allen Lu, Chia-Ming Cheng, Chih-Cheng Chen, Jia-Ying Yong, Hong-Han Shuai, Wen-Huang Cheng, Zhuang Jia, Tianyu Xu, Yijian Zhang, Long Bao, Heng Sun, Diankai Zhang, Si Gao, Shaoli Liu, Biao Wu, Xiaofeng Zhang, Chengjian Zheng, Kaidi Lu, Ning Wang , et al. (29 additional authors not shown)

Abstract: Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this prob… ▽ More Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: arXiv admin note: text overlap with arXiv:2105.08826, arXiv:2105.07809, arXiv:2211.04470, arXiv:2211.03885

arXiv:2210.12723 [pdf]

A Faithful Deep Sensitivity Estimation for Accelerated Magnetic Resonance Imaging

Authors: Zi Wang, Haoming Fang, Chen Qian, Boxuan Shi, Lijun Bao, Liuhong Zhu, Jianjun Zhou, Wen** Wei, Jianzhong Lin, Di Guo, Xiaobo Qu

Abstract: Magnetic resonance imaging (MRI) is an essential diagnostic tool that suffers from prolonged scan time. To alleviate this limitation, advanced fast MRI technology attracts extensive research interests. Recent deep learning has shown its great potential in improving image quality and reconstruction speed. Faithful coil sensitivity estimation is vital for MRI reconstruction. However, most deep learn… ▽ More Magnetic resonance imaging (MRI) is an essential diagnostic tool that suffers from prolonged scan time. To alleviate this limitation, advanced fast MRI technology attracts extensive research interests. Recent deep learning has shown its great potential in improving image quality and reconstruction speed. Faithful coil sensitivity estimation is vital for MRI reconstruction. However, most deep learning methods still rely on pre-estimated sensitivity maps and ignore their inaccuracy, resulting in the significant quality degradation of reconstructed images. In this work, we propose a Joint Deep Sensitivity estimation and Image reconstruction network, called JDSI. During the image artifacts removal, it gradually provides more faithful sensitivity maps with high-frequency information, leading to improved image reconstructions. To understand the behavior of the network, the mutual promotion of sensitivity estimation and image reconstruction is revealed through the visualization of network intermediate results. Results on in vivo datasets and radiologist reader study demonstrate that, for both calibration-based and calibrationless reconstruction, the proposed JDSI achieves the state-of-the-art performance visually and quantitatively, especially when the acceleration factor is high. Additionally, JDSI owns nice robustness to patients and autocalibration signals. △ Less

Submitted 24 December, 2023; v1 submitted 23 October, 2022; originally announced October 2022.

Comments: 12 pages, 13 figures, 7 tables

arXiv:2210.00841 [pdf, other]

Smooth image-to-image translations with latent space interpolations

Authors: Yahui Liu, Enver Sangineto, Ya**g Chen, Linchao Bao, Haoxian Zhang, Nicu Sebe, Bruno Lepri, Marco De Nadai

Abstract: Multi-domain image-to-image (I2I) translations can transform a source image according to the style of a target domain. One important, desired characteristic of these transformations, is their graduality, which corresponds to a smooth change between the source and the target image when their respective latent-space representations are linearly interpolated. However, state-of-the-art methods usually… ▽ More Multi-domain image-to-image (I2I) translations can transform a source image according to the style of a target domain. One important, desired characteristic of these transformations, is their graduality, which corresponds to a smooth change between the source and the target image when their respective latent-space representations are linearly interpolated. However, state-of-the-art methods usually perform poorly when evaluated using inter-domain interpolations, often producing abrupt changes in the appearance or non-realistic intermediate images. In this paper, we argue that one of the main reasons behind this problem is the lack of sufficient inter-domain training data and we propose two different regularization methods to alleviate this issue: a new shrinkage loss, which compacts the latent space, and a Mixup data-augmentation strategy, which flattens the style representations between domains. We also propose a new metric to quantitatively evaluate the degree of the interpolation smoothness, an aspect which is not sufficiently covered by the existing I2I translation metrics. Using both our proposed metric and standard evaluation protocols, we show that our regularization techniques can improve the state-of-the-art multi-domain I2I translations by a large margin. Our code will be made publicly available upon the acceptance of this article. △ Less

Submitted 14 March, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

arXiv:2209.13204 [pdf, other]

NEURAL MARIONETTE: A Transformer-based Multi-action Human Motion Synthesis System

Authors: Weiqiang Wang, Xuefei Zhe, Qiuhong Ke, Di Kang, Tingguang Li, Ruizhi Chen, Linchao Bao

Abstract: We present a neural network-based system for long-term, multi-action human motion synthesis. The system, dubbed as NEURAL MARIONETTE, can produce high-quality and meaningful motions with smooth transitions from simple user input, including a sequence of action tags with expected action duration, and optionally a hand-drawn moving trajectory if the user specifies. The core of our system is a novel… ▽ More We present a neural network-based system for long-term, multi-action human motion synthesis. The system, dubbed as NEURAL MARIONETTE, can produce high-quality and meaningful motions with smooth transitions from simple user input, including a sequence of action tags with expected action duration, and optionally a hand-drawn moving trajectory if the user specifies. The core of our system is a novel Transformer-based motion generation model, namely MARIONET, which can generate diverse motions given action tags. Different from existing motion generation models, MARIONET utilizes contextual information from the past motion clip and future action tag, dedicated to generating actions that can smoothly blend historical and future actions. Specifically, MARIONET first encodes target action tag and contextual information into an action-level latent code. The code is unfolded into frame-level control signals via a time unrolling module, which could be then combined with other frame-level control signals like the target trajectory. Motion frames are then generated in an auto-regressive way. By sequentially applying MARIONET, the system NEURAL MARIONETTE can robustly generate long-term, multi-action motions with the help of two simple schemes, namely "Shadow Start" and "Action Revision". Along with the novel system, we also present a new dataset dedicated to the multi-action motion synthesis task, which contains both action tags and their contextual information. Extensive experiments are conducted to study the action accuracy, naturalism, and transition smoothness of the motions generated by our system. △ Less

Submitted 27 November, 2023; v1 submitted 27 September, 2022; originally announced September 2022.

arXiv:2208.14600 [pdf, other]

ELSR: Extreme Low-Power Super Resolution Network For Mobile Devices

Authors: Tianyu Xu, Zhuang Jia, Yijian Zhang, Long Bao, Heng Sun

Abstract: With the popularity of mobile devices, e.g., smartphone and wearable devices, lighter and faster model is crucial for the application of video super resolution. However, most previous lightweight models tend to concentrate on reducing lantency of model inference on desktop GPU, which may be not energy efficient in current mobile devices. In this paper, we proposed Extreme Low-Power Super Resolutio… ▽ More With the popularity of mobile devices, e.g., smartphone and wearable devices, lighter and faster model is crucial for the application of video super resolution. However, most previous lightweight models tend to concentrate on reducing lantency of model inference on desktop GPU, which may be not energy efficient in current mobile devices. In this paper, we proposed Extreme Low-Power Super Resolution (ELSR) network which only consumes a small amount of energy in mobile devices. Pretraining and finetuning methods are applied to boost the performance of the extremely tiny model. Extensive experiments show that our method achieves a excellent balance between restoration quality and power consumption. Finally, we achieve a competitive score of 90.9 with PSNR 27.34 dB and power 0.09 W/30FPS on the target MediaTek Dimensity 9000 plantform, ranking 1st place in the Mobile AI & AIM 2022 Real-Time Video Super-Resolution Challenge. △ Less

Submitted 30 August, 2022; originally announced August 2022.

arXiv:2208.11948 [pdf, other]

Learning to Construct 3D Building Wireframes from 3D Line Clouds

Authors: Yicheng Luo, **g Ren, Xuefei Zhe, Di Kang, Ya**g Xu, Peter Wonka, Linchao Bao

Abstract: Line clouds, though under-investigated in the previous work, potentially encode more compact structural information of buildings than point clouds extracted from multi-view images. In this work, we propose the first network to process line clouds for building wireframe abstraction. The network takes a line cloud as input , i.e., a nonstructural and unordered set of 3D line segments extracted from… ▽ More Line clouds, though under-investigated in the previous work, potentially encode more compact structural information of buildings than point clouds extracted from multi-view images. In this work, we propose the first network to process line clouds for building wireframe abstraction. The network takes a line cloud as input , i.e., a nonstructural and unordered set of 3D line segments extracted from multi-view images, and outputs a 3D wireframe of the underlying building, which consists of a sparse set of 3D junctions connected by line segments. We observe that a line patch, i.e., a group of neighboring line segments, encodes sufficient contour information to predict the existence and even the 3D position of a potential junction, as well as the likelihood of connectivity between two query junctions. We therefore introduce a two-layer Line-Patch Transformer to extract junctions and connectivities from sampled line patches to form a 3D building wireframe model. We also introduce a synthetic dataset of multi-view images with ground-truth 3D wireframe. We extensively justify that our reconstructed 3D wireframe models significantly improve upon multiple baseline building reconstruction methods. The code and data can be found at https://github.com/Luo1Cheng/LC2WF. △ Less

Submitted 4 November, 2022; v1 submitted 25 August, 2022; originally announced August 2022.

Comments: 10 pages, 6 figures

arXiv:2208.10769 [pdf, other]

PIFu for the Real World: A Self-supervised Framework to Reconstruct Dressed Human from Single-view Images

Authors: Zhangyang Xiong, Dong Du, Yushuang Wu, **gqi Dong, Di Kang, Linchao Bao, Xiaoguang Han

Abstract: It is very challenging to accurately reconstruct sophisticated human geometry caused by various poses and garments from a single image. Recently, works based on pixel-aligned implicit function (PIFu) have made a big step and achieved state-of-the-art fidelity on image-based 3D human digitization. However, the training of PIFu relies heavily on expensive and limited 3D ground truth data (i.e. synth… ▽ More It is very challenging to accurately reconstruct sophisticated human geometry caused by various poses and garments from a single image. Recently, works based on pixel-aligned implicit function (PIFu) have made a big step and achieved state-of-the-art fidelity on image-based 3D human digitization. However, the training of PIFu relies heavily on expensive and limited 3D ground truth data (i.e. synthetic data), thus hindering its generalization to more diverse real world images. In this work, we propose an end-to-end self-supervised network named SelfPIFu to utilize abundant and diverse in-the-wild images, resulting in largely improved reconstructions when tested on unconstrained in-the-wild images. At the core of SelfPIFu is the depth-guided volume-/surface-aware signed distance fields (SDF) learning, which enables self-supervised learning of a PIFu without access to GT mesh. The whole framework consists of a normal estimator, a depth estimator, and a SDF-based PIFu and better utilizes extra depth GT during training. Extensive experiments demonstrate the effectiveness of our self-supervised framework and the superiority of using depth as input. On synthetic data, our Intersection-Over-Union (IoU) achieves to 93.5%, 18% higher compared with PIFuHD. For in-the-wild images, we conduct user studies on the reconstructed results, the selection rate of our results is over 68% compared with other state-of-the-art methods. △ Less

Submitted 8 March, 2024; v1 submitted 23 August, 2022; originally announced August 2022.

Comments: CVM 2024

arXiv:2206.06715 [pdf, other]

Semi-signed prioritized neural fitting for surface reconstruction from unoriented point clouds

Authors: Runsong Zhu, Di Kang, Ka-Hei Hui, Yue Qian, Xuefei Zhe, Zhen Dong, Linchao Bao, Pheng-Ann Heng, Chi-Wing Fu

Abstract: Reconstructing 3D geometry from \emph{unoriented} point clouds can benefit many downstream tasks. Recent shape modeling methods mostly adopt implicit neural representation to fit a signed distance field (SDF) and optimize the network by \emph{unsigned} supervision. However, these methods occasionally have difficulty in finding the coarse shape for complicated objects, especially suffering from the… ▽ More Reconstructing 3D geometry from \emph{unoriented} point clouds can benefit many downstream tasks. Recent shape modeling methods mostly adopt implicit neural representation to fit a signed distance field (SDF) and optimize the network by \emph{unsigned} supervision. However, these methods occasionally have difficulty in finding the coarse shape for complicated objects, especially suffering from the ``ghost'' surfaces (\ie, fake surfaces that should not exist). To guide the network quickly fit the coarse shape, we propose to utilize the signed supervision in regions that are obviously outside the object and can be easily determined, resulting in our semi-signed supervision. To better recover high-fidelity details, a novel importance sampling based on tracked region losses and a progressive positional encoding (PE) prioritize the optimization towards underfitting and complicated regions. Specifically, we voxelize and partition the object space into \emph{sign-known} and \emph{sign-uncertain} regions, in which different supervisions are applied. Besides, we adaptively adjust the sampling rate of each voxel according to the tracked reconstruction loss, so that the network can focus more on the complicated under-fitting regions. To this end, we propose our semi-signed prioritized (SSP) neural fitting, and conduct extensive experiments to demonstrate that SSP achieves state-of-the-art performance on multiple datasets including the ABC subset and various challenging data. The code will be released upon the publication. △ Less

Submitted 14 December, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

arXiv:2205.12633 [pdf, other]

NTIRE 2022 Challenge on High Dynamic Range Imaging: Methods and Results

Authors: Eduardo Pérez-Pellitero, Sibi Catley-Chandar, Richard Shaw, Aleš Leonardis, Radu Timofte, Zexin Zhang, Cen Liu, Yunbo Peng, Yue Lin, Gaocheng Yu, ** Zhang, Zhe Ma, Hongbin Wang, Xiangyu Chen, Xintao Wang, Haiwei Wu, Lin Liu, Chao Dong, Jiantao Zhou, Qingsen Yan, Song Zhang, Weiye Chen, Yuhang Liu, Zhen Zhang, Yanning Zhang , et al. (68 additional authors not shown)

Abstract: This paper reviews the challenge on constrained high dynamic range (HDR) imaging that was part of the New Trends in Image Restoration and Enhancement (NTIRE) workshop, held in conjunction with CVPR 2022. This manuscript focuses on the competition set-up, datasets, the proposed methods and their results. The challenge aims at estimating an HDR image from multiple respective low dynamic range (LDR)… ▽ More This paper reviews the challenge on constrained high dynamic range (HDR) imaging that was part of the New Trends in Image Restoration and Enhancement (NTIRE) workshop, held in conjunction with CVPR 2022. This manuscript focuses on the competition set-up, datasets, the proposed methods and their results. The challenge aims at estimating an HDR image from multiple respective low dynamic range (LDR) observations, which might suffer from under- or over-exposed regions and different sources of noise. The challenge is composed of two tracks with an emphasis on fidelity and complexity constraints: In Track 1, participants are asked to optimize objective fidelity scores while imposing a low-complexity constraint (i.e. solutions can not exceed a given number of operations). In Track 2, participants are asked to minimize the complexity of their solutions while imposing a constraint on fidelity scores (i.e. solutions are required to obtain a higher fidelity score than the prescribed baseline). Both tracks use the same data and metrics: Fidelity is measured by means of PSNR with respect to a ground-truth HDR image (computed both directly and with a canonical tonemap** operation), while complexity metrics include the number of Multiply-Accumulate (MAC) operations and runtime (in seconds). △ Less

Submitted 25 May, 2022; originally announced May 2022.

Comments: CVPR Workshops 2022. 15 pages, 21 figures, 2 tables

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022

arXiv:2204.01147 [pdf, other]

Continuous Jum** for Legged Robots on Step** Stones via Trajectory Optimization and Model Predictive Control

Authors: Chuong Nguyen, Lingfan Bao, Quan Nguyen

Abstract: Performing highly agile dynamic motions, such as jum** or running on uneven step** stones has remained a challenging problem in legged robot locomotion. This paper presents a framework that combines trajectory optimization and model predictive control to perform robust and consecutive jum** on step** stones. In our approach, we first utilize trajectory optimization based on full-nonlinear… ▽ More Performing highly agile dynamic motions, such as jum** or running on uneven step** stones has remained a challenging problem in legged robot locomotion. This paper presents a framework that combines trajectory optimization and model predictive control to perform robust and consecutive jum** on step** stones. In our approach, we first utilize trajectory optimization based on full-nonlinear dynamics of the robot to generate periodic jum** trajectories for various jum** distances. A jum** controller based on a model predictive control is then designed for realizing smooth jum** transitions, enabling the robot to achieve continuous jumps on step** stones. Thanks to the incorporation of MPC as a real-time feedback controller, the proposed framework is also validated to be robust to uneven platforms with unknown height perturbations and model uncertainty on the robot dynamics. △ Less

Submitted 16 September, 2022; v1 submitted 3 April, 2022; originally announced April 2022.

Comments: Accepted to the 61st IEEE Conference on Decision and Control (CDC 2022)

arXiv:2203.09729 [pdf, other]

REALY: Rethinking the Evaluation of 3D Face Reconstruction

Authors: Zenghao Chai, Haoxian Zhang, **g Ren, Di Kang, Zhengzhuo Xu, Xuefei Zhe, Chun Yuan, Linchao Bao

Abstract: The evaluation of 3D face reconstruction results typically relies on a rigid shape alignment between the estimated 3D model and the ground-truth scan. We observe that aligning two shapes with different reference points can largely affect the evaluation results. This poses difficulties for precisely diagnosing and improving a 3D face reconstruction method. In this paper, we propose a novel evaluati… ▽ More The evaluation of 3D face reconstruction results typically relies on a rigid shape alignment between the estimated 3D model and the ground-truth scan. We observe that aligning two shapes with different reference points can largely affect the evaluation results. This poses difficulties for precisely diagnosing and improving a 3D face reconstruction method. In this paper, we propose a novel evaluation approach with a new benchmark REALY, consists of 100 globally aligned face scans with accurate facial keypoints, high-quality region masks, and topology-consistent meshes. Our approach performs region-wise shape alignment and leads to more accurate, bidirectional correspondences during computing the shape errors. The fine-grained, region-wise evaluation results provide us detailed understandings about the performance of state-of-the-art 3D face reconstruction methods. For example, our experiments on single-image based reconstruction methods reveal that DECA performs the best on nose regions, while GANFit performs better on cheek regions. Besides, a new and high-quality 3DMM basis, HIFI3D++, is further derived using the same procedure as we construct REALY to align and retopologize several 3D face datasets. We will release REALY, HIFI3D++, and our new evaluation pipeline at https://realy3dface.com. △ Less

Submitted 19 July, 2022; v1 submitted 18 March, 2022; originally announced March 2022.

Comments: Accepted to ECCV 2022, camera-ready version; Project page: https://realy3dface.com; Code: https://github.com/czh-98/REALY

arXiv:2203.07160 [pdf, other]

CAR: Class-aware Regularizations for Semantic Segmentation

Authors: Ye Huang, Di Kang, Liang Chen, Xuefei Zhe, Wen**g Jia, Xiangjian He, Linchao Bao

Abstract: Recent segmentation methods, such as OCR and CPNet, utilizing "class level" information in addition to pixel features, have achieved notable success for boosting the accuracy of existing network modules. However, the extracted class-level information was simply concatenated to pixel features, without explicitly being exploited for better pixel representation learning. Moreover, these approaches le… ▽ More Recent segmentation methods, such as OCR and CPNet, utilizing "class level" information in addition to pixel features, have achieved notable success for boosting the accuracy of existing network modules. However, the extracted class-level information was simply concatenated to pixel features, without explicitly being exploited for better pixel representation learning. Moreover, these approaches learn soft class centers based on coarse mask prediction, which is prone to error accumulation. In this paper, aiming to use class level information more effectively, we propose a universal Class-Aware Regularization (CAR) approach to optimize the intra-class variance and inter-class distance during feature learning, motivated by the fact that humans can recognize an object by itself no matter which other objects it appears with. Three novel loss functions are proposed. The first loss function encourages more compact class representations within each class, the second directly maximizes the distance between different class centers, and the third further pushes the distance between inter-class centers and pixels. Furthermore, the class center in our approach is directly generated from ground truth instead of from the error-prone coarse prediction. Our method can be easily applied to most existing segmentation models during training, including OCR and CPNet, and can largely improve their accuracy at no additional inference overhead. Extensive experiments and ablation studies conducted on multiple benchmark datasets demonstrate that the proposed CAR can boost the accuracy of all baseline models by up to 2.23% mIOU with superior generalization ability. The complete code is available at https://github.com/edwardyehuang/CAR. △ Less

Submitted 14 July, 2022; v1 submitted 14 March, 2022; originally announced March 2022.

Comments: ECCV 2022 camera ready. Codes and models are available at https://github.com/edwardyehuang/CAR

arXiv:2203.00803 [pdf, other]

Code Smells in Machine Learning Systems

Authors: Jiri Gesi, Siqi Liu, Jiawei Li, Iftekhar Ahmed, Nachiappan Nagappan, David Lo, Eduardo Santana de Almeida, Pavneet Singh Kochhar, Lingfeng Bao

Abstract: As Deep learning (DL) systems continuously evolve and grow, assuring their quality becomes an important yet challenging task. Compared to non-DL systems, DL systems have more complex team compositions and heavier data dependency. These inherent characteristics would potentially cause DL systems to be more vulnerable to bugs and, in the long run, to maintenance issues. Code smells are empirically t… ▽ More As Deep learning (DL) systems continuously evolve and grow, assuring their quality becomes an important yet challenging task. Compared to non-DL systems, DL systems have more complex team compositions and heavier data dependency. These inherent characteristics would potentially cause DL systems to be more vulnerable to bugs and, in the long run, to maintenance issues. Code smells are empirically tested as efficient indicators of non-DL systems. Therefore, we took a step forward into identifying code smells, and understanding their impact on maintenance in this comprehensive study. This is the first study on investigating code smells in the context of DL software systems, which helps researchers and practitioners to get a first look at what kind of maintenance modification made and what code smells developers have been dealing with. Our paper has three major contributions. First, we comprehensively investigated the maintenance modifications that have been made by DL developers via studying the evolution of DL systems, and we identified nine frequently occurred maintenance-related modification categories in DL systems. Second, we summarized five code smells in DL systems. Third, we validated the prevalence, and the impact of our newly identified code smells through a mixture of qualitative and quantitative analysis. We found that our newly identified code smells are prevalent and impactful on the maintenance of DL systems from the developer's perspective. △ Less

Submitted 1 March, 2022; originally announced March 2022.

arXiv:2202.04879 [pdf, other]

PVSeRF: Joint Pixel-, Voxel- and Surface-Aligned Radiance Field for Single-Image Novel View Synthesis

Authors: Xianggang Yu, Jiapeng Tang, Yipeng Qin, Chenghong Li, Linchao Bao, Xiaoguang Han, Shuguang Cui

Abstract: We present PVSeRF, a learning framework that reconstructs neural radiance fields from single-view RGB images, for novel view synthesis. Previous solutions, such as pixelNeRF, rely only on pixel-aligned features and suffer from feature ambiguity issues. As a result, they struggle with the disentanglement of geometry and appearance, leading to implausible geometries and blurry results. To address th… ▽ More We present PVSeRF, a learning framework that reconstructs neural radiance fields from single-view RGB images, for novel view synthesis. Previous solutions, such as pixelNeRF, rely only on pixel-aligned features and suffer from feature ambiguity issues. As a result, they struggle with the disentanglement of geometry and appearance, leading to implausible geometries and blurry results. To address this challenge, we propose to incorporate explicit geometry reasoning and combine it with pixel-aligned features for radiance field prediction. Specifically, in addition to pixel-aligned features, we further constrain the radiance field learning to be conditioned on i) voxel-aligned features learned from a coarse volumetric grid and ii) fine surface-aligned features extracted from a regressed point cloud. We show that the introduction of such geometry-aware features helps to achieve a better disentanglement between appearance and geometry, i.e. recovering more accurate geometries and synthesizing higher quality images of novel views. Extensive experiments against state-of-the-art methods on ShapeNet benchmarks demonstrate the superiority of our approach for single-image novel view synthesis. △ Less

Submitted 10 February, 2022; originally announced February 2022.

arXiv:2201.09548 [pdf, other]

doi 10.1109/TPAMI.2023.3247907

Consistent 3D Hand Reconstruction in Video via self-supervised Learning

Authors: Zhigang Tu, Zhisheng Huang, Yu** Chen, Di Kang, Linchao Bao, Bisheng Yang, Junsong Yuan

Abstract: We present a method for reconstructing accurate and consistent 3D hands from a monocular video. We observe that detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand, which can reduce or even eliminate the requirement on 3D hand annotation. Thus we propose ${\rm {S}^{2}HAND}$, a self-supervised 3D hand reconstruction model, that can j… ▽ More We present a method for reconstructing accurate and consistent 3D hands from a monocular video. We observe that detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand, which can reduce or even eliminate the requirement on 3D hand annotation. Thus we propose ${\rm {S}^{2}HAND}$, a self-supervised 3D hand reconstruction model, that can jointly estimate pose, shape, texture, and the camera viewpoint from a single RGB input through the supervision of easily accessible 2D detected keypoints. We leverage the continuous hand motion information contained in the unlabeled video data and propose ${\rm {S}^{2}HAND(V)}$, which uses a set of weights shared ${\rm {S}^{2}HAND}$ to process each frame and exploits additional motion, texture, and shape consistency constrains to promote more accurate hand poses and more consistent shapes and textures. Experiments on benchmark datasets demonstrate that our self-supervised approach produces comparable hand reconstruction performance compared with the recent full-supervised methods in single-frame as input setup, and notably improves the reconstruction accuracy and consistency when using video training data. △ Less

Submitted 20 March, 2023; v1 submitted 24 January, 2022; originally announced January 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2103.11703

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023

arXiv:2111.15234 [pdf, other]

NeRFReN: Neural Radiance Fields with Reflections

Authors: Yuan-Chen Guo, Di Kang, Linchao Bao, Yu He, Song-Hai Zhang

Abstract: Neural Radiance Fields (NeRF) has achieved unprecedented view synthesis quality using coordinate-based neural scene representations. However, NeRF's view dependency can only handle simple reflections like highlights but cannot deal with complex reflections such as those from glass and mirrors. In these scenarios, NeRF models the virtual image as real geometries which leads to inaccurate depth esti… ▽ More Neural Radiance Fields (NeRF) has achieved unprecedented view synthesis quality using coordinate-based neural scene representations. However, NeRF's view dependency can only handle simple reflections like highlights but cannot deal with complex reflections such as those from glass and mirrors. In these scenarios, NeRF models the virtual image as real geometries which leads to inaccurate depth estimation, and produces blurry renderings when the multi-view consistency is violated as the reflected objects may only be seen under some of the viewpoints. To overcome these issues, we introduce NeRFReN, which is built upon NeRF to model scenes with reflections. Specifically, we propose to split a scene into transmitted and reflected components, and model the two components with separate neural radiance fields. Considering that this decomposition is highly under-constrained, we exploit geometric priors and apply carefully-designed training strategies to achieve reasonable decomposition results. Experiments on various self-captured scenes show that our method achieves high-quality novel view synthesis and physically sound depth estimation results while enabling scene editing applications. △ Less

Submitted 6 April, 2022; v1 submitted 30 November, 2021; originally announced November 2021.

Comments: Accepted to CVPR 2022. Project page: https://bennyguo.github.io/nerfren/

arXiv:2109.12492 [pdf, other]

ISF-GAN: An Implicit Style Function for High-Resolution Image-to-Image Translation

Authors: Yahui Liu, Ya**g Chen, Linchao Bao, Nicu Sebe, Bruno Lepri, Marco De Nadai

Abstract: Recently, there has been an increasing interest in image editing methods that employ pre-trained unconditional image generators (e.g., StyleGAN). However, applying these methods to translate images to multiple visual domains remains challenging. Existing works do not often preserve the domain-invariant part of the image (e.g., the identity in human face translations), they do not usually handle mu… ▽ More Recently, there has been an increasing interest in image editing methods that employ pre-trained unconditional image generators (e.g., StyleGAN). However, applying these methods to translate images to multiple visual domains remains challenging. Existing works do not often preserve the domain-invariant part of the image (e.g., the identity in human face translations), they do not usually handle multiple domains, or do not allow for multi-modal translations. This work proposes an implicit style function (ISF) to straightforwardly achieve multi-modal and multi-domain image-to-image translation from pre-trained unconditional generators. The ISF manipulates the semantics of an input latent code to make the image generated from it lying in the desired visual domain. Our results in human face and animal manipulations show significantly improved results over the baselines. Our model enables cost-effective multi-modal unsupervised image-to-image translations at high resolution using pre-trained unconditional GANs. The code and data are available at: \url{https://github.com/yhlleo/stylegan-mmuit}. △ Less

Submitted 23 February, 2022; v1 submitted 26 September, 2021; originally announced September 2021.

Comments: 14 pages, 15 figures

Journal ref: IEEE Transactions on Multimedia, 2022

arXiv:2108.06720 [pdf, other]

Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders

Authors: **g Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, Linchao Bao

Abstract: Generating conversational gestures from speech audio is challenging due to the inherent one-to-many map** between audio and body motions. Conventional CNNs/RNNs assume one-to-one map**, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we propose a novel conditional variational autoencoder… ▽ More Generating conversational gestures from speech audio is challenging due to the inherent one-to-many map** between audio and body motions. Conventional CNNs/RNNs assume one-to-one map**, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion map** by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A map** network facilitating random sampling along with other techniques including relaxed motion loss, bicycle constraint, and diversity loss are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. Code and more results are at https://**gli513.github.io/audio2gestures. △ Less

Submitted 15 August, 2021; originally announced August 2021.

arXiv:2108.05650 [pdf, other]

UniFaceGAN: A Unified Framework for Temporally Consistent Facial Video Editing

Authors: Meng Cao, Haozhi Huang, Hao Wang, Xuan Wang, Li Shen, Sheng Wang, Linchao Bao, Zhifeng Li, Jiebo Luo

Abstract: Recent research has witnessed advances in facial image editing tasks including face swap** and face reenactment. However, these methods are confined to dealing with one specific task at a time. In addition, for video facial editing, previous methods either simply apply transformations frame by frame or utilize multiple frames in a concatenated or iterative fashion, which leads to noticeable visu… ▽ More Recent research has witnessed advances in facial image editing tasks including face swap** and face reenactment. However, these methods are confined to dealing with one specific task at a time. In addition, for video facial editing, previous methods either simply apply transformations frame by frame or utilize multiple frames in a concatenated or iterative fashion, which leads to noticeable visual flickers. In this paper, we propose a unified temporally consistent facial video editing framework termed UniFaceGAN. Based on a 3D reconstruction model and a simple yet efficient dynamic training sample selection mechanism, our framework is designed to handle face swap** and face reenactment simultaneously. To enforce the temporal consistency, a novel 3D temporal loss constraint is introduced based on the barycentric coordinate interpolation. Besides, we propose a region-aware conditional normalization layer to replace the traditional AdaIN or SPADE to synthesize more context-harmonious results. Compared with the state-of-the-art facial image editing methods, our framework generates video portraits that are more photo-realistic and temporally smooth. △ Less

Submitted 12 August, 2021; originally announced August 2021.

Comments: Accepted by IEEE Transactions on Image Processing (TIP). arXiv admin note: text overlap with arXiv:2007.01466

arXiv:2106.13629 [pdf, other]

Animatable Neural Radiance Fields from Monocular RGB Videos

Authors: Jianchuan Chen, Ying Zhang, Di Kang, Xuefei Zhe, Linchao Bao, Xu Jia, Huchuan Lu

Abstract: We present animatable neural radiance fields (animatable NeRF) for detailed human avatar creation from monocular videos. Our approach extends neural radiance fields (NeRF) to the dynamic scenes with human movements via introducing explicit pose-guided deformation while learning the scene representation network. In particular, we estimate the human pose for each frame and learn a constant canonical… ▽ More We present animatable neural radiance fields (animatable NeRF) for detailed human avatar creation from monocular videos. Our approach extends neural radiance fields (NeRF) to the dynamic scenes with human movements via introducing explicit pose-guided deformation while learning the scene representation network. In particular, we estimate the human pose for each frame and learn a constant canonical space for the detailed human template, which enables natural shape deformation from the observation space to the canonical space under the explicit control of the pose parameters. To compensate for inaccurate pose estimation, we introduce the pose refinement strategy that updates the initial pose during the learning process, which not only helps to learn more accurate human reconstruction but also accelerates the convergence. In experiments we show that the proposed approach achieves 1) implicit human geometry and appearance reconstruction with high-quality details, 2) photo-realistic rendering of the human from novel views, and 3) animation of the human with novel poses. △ Less

Submitted 7 September, 2021; v1 submitted 25 June, 2021; originally announced June 2021.

Comments: 12 pages, 12 figures

arXiv:2106.09016 [pdf, other]

Smoothing the Disentangled Latent Style Space for Unsupervised Image-to-Image Translation

Authors: Yahui Liu, Enver Sangineto, Ya**g Chen, Linchao Bao, Haoxian Zhang, Nicu Sebe, Bruno Lepri, Wei Wang, Marco De Nadai

Abstract: Image-to-Image (I2I) multi-domain translation models are usually evaluated also using the quality of their semantic interpolation results. However, state-of-the-art models frequently show abrupt changes in the image appearance during interpolation, and usually perform poorly in interpolations across domains. In this paper, we propose a new training protocol based on three specific losses which hel… ▽ More Image-to-Image (I2I) multi-domain translation models are usually evaluated also using the quality of their semantic interpolation results. However, state-of-the-art models frequently show abrupt changes in the image appearance during interpolation, and usually perform poorly in interpolations across domains. In this paper, we propose a new training protocol based on three specific losses which help a translation network to learn a smooth and disentangled latent style space in which: 1) Both intra- and inter-domain interpolations correspond to gradual changes in the generated images and 2) The content of the source image is better preserved during the translation. Moreover, we propose a novel evaluation metric to properly measure the smoothness of latent style space of I2I translation models. The proposed method can be plugged into existing translation approaches, and our extensive experiments on different datasets show that it can significantly boost the quality of the generated images and the graduality of the interpolations. △ Less

Submitted 16 June, 2021; originally announced June 2021.

Comments: Accepted to CVPR 2021

arXiv:2103.11703 [pdf, other]

Model-based 3D Hand Reconstruction via Self-Supervised Learning

Authors: Yu** Chen, Zhigang Tu, Di Kang, Linchao Bao, Ying Zhang, Xuefei Zhe, Ruizhi Chen, Junsong Yuan

Abstract: Reconstructing a 3D hand from a single-view RGB image is challenging due to various hand configurations and depth ambiguity. To reliably reconstruct a 3D hand from a monocular image, most state-of-the-art methods heavily rely on 3D annotations at the training stage, but obtaining 3D annotations is expensive. To alleviate reliance on labeled training data, we propose S2HAND, a self-supervised 3D ha… ▽ More Reconstructing a 3D hand from a single-view RGB image is challenging due to various hand configurations and depth ambiguity. To reliably reconstruct a 3D hand from a monocular image, most state-of-the-art methods heavily rely on 3D annotations at the training stage, but obtaining 3D annotations is expensive. To alleviate reliance on labeled training data, we propose S2HAND, a self-supervised 3D hand reconstruction network that can jointly estimate pose, shape, texture, and the camera viewpoint. Specifically, we obtain geometric cues from the input image through easily accessible 2D detected keypoints. To learn an accurate hand reconstruction model from these noisy geometric cues, we utilize the consistency between 2D and 3D representations and propose a set of novel losses to rationalize outputs of the neural network. For the first time, we demonstrate the feasibility of training an accurate 3D hand reconstruction network without relying on manual annotations. Our experiments show that the proposed method achieves comparable performance with recent fully-supervised methods while using fewer supervision data. △ Less

Submitted 22 March, 2021; originally announced March 2021.

Comments: Accepted by CVPR21

arXiv:2103.11610 [pdf, other]

doi 10.1145/3392093

psc2code: Denoising Code Extraction from Programming Screencasts

Authors: Lingfeng Bao, Zhenchang Xing, Xin Xia, David Lo, Minghui Wu, Xiaohu Yang

Abstract: In this paper, we propose an approach named psc2code to denoise the process of extracting source code from programming screencasts. First, psc2code leverages the Convolutional Neural Network based image classification to remove non-code and noisy-code frames. Then, psc2code performs edge detection and clustering-based image segmentation to detect sub-windows in a code frame, and based on the detec… ▽ More In this paper, we propose an approach named psc2code to denoise the process of extracting source code from programming screencasts. First, psc2code leverages the Convolutional Neural Network based image classification to remove non-code and noisy-code frames. Then, psc2code performs edge detection and clustering-based image segmentation to detect sub-windows in a code frame, and based on the detected sub-windows, it identifies and crops the screen region that is most likely to be a code editor. Finally, psc2code calls the API of a professional OCR tool to extract source code from the cropped code regions and leverages the OCRed cross-frame information in the programming screencast and the statistical language model of a large corpus of source code to correct errors in the OCRed source code. We conduct an experiment on 1,142 programming screencasts from YouTube. We find that our CNN-based image classification technique can effectively remove the non-code and noisy-code frames, which achieves an F1-score of 0.95 on the valid code frames. Based on the source code denoised by psc2code, we implement two applications: 1) a programming screencast search engine; 2) an interaction-enhanced programming screencast watching tool. Based on the source code extracted from the 1,142 collected programming screencasts, our experiments show that our programming screencast search engine achieves the precision@5, 10, and 20 of 0.93, 0.81, and 0.63, respectively. △ Less

Submitted 22 March, 2021; originally announced March 2021.

Comments: pre-print TOSEM paper for ICSE 2021

Journal ref: ACM Trans. Softw. Eng. Methodol. 29, 3, Article 21 (July 2020), 38 pages

arXiv:2012.07565 [pdf, other]

Automating Document Classification with Distant Supervision to Increase the Efficiency of Systematic Reviews

Authors: Xiaoxiao Li, Rabah Al-Zaidy, Amy Zhang, Stefan Baral, Le Bao, C. Lee Giles

Abstract: Objective: Systematic reviews of scholarly documents often provide complete and exhaustive summaries of literature relevant to a research question. However, well-done systematic reviews are expensive, time-demanding, and labor-intensive. Here, we propose an automatic document classification approach to significantly reduce the effort in reviewing documents. Methods: We first describe a manual docu… ▽ More Objective: Systematic reviews of scholarly documents often provide complete and exhaustive summaries of literature relevant to a research question. However, well-done systematic reviews are expensive, time-demanding, and labor-intensive. Here, we propose an automatic document classification approach to significantly reduce the effort in reviewing documents. Methods: We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based refined method, and a random forest approach that utilizes a large set of feature tokens. As an example, this approach is used to identify documents studying female sex workers that are assumed to contain content relevant to either HIV or violence. We compare the performance of the three classifiers by cross-validation and conduct a sensitivity analysis on the portion of data utilized in training the model. Results: The random forest approach provides the highest area under the curve (AUC) for both receiver operating characteristic (ROC) and precision/recall (PR). Analyses of precision and recall suggest that random forest could facilitate manually reviewing 20\% of the articles while containing 80\% of the relevant cases. Finally, we found a good classifier could be obtained by using a relatively small training sample size. Conclusions: In sum, the automated procedure of document classification presented here could improve both the precision and efficiency of systematic reviews, as well as facilitating live reviews, where reviews are updated regularly. △ Less

Submitted 9 December, 2020; originally announced December 2020.

arXiv:2011.14238 [pdf, other]

Approximate Cross-validated Mean Estimates for Bayesian Hierarchical Regression Models

Authors: Amy X. Zhang, Le Bao, Changcheng Li, Michael J. Daniels

Abstract: We introduce a novel procedure for obtaining cross-validated predictive estimates for Bayesian hierarchical regression models (BHRMs). Bayesian hierarchical models are popular for their ability to model complex dependence structures and provide probabilistic uncertainty estimates, but can be computationally expensive to run. Cross-validation (CV) is therefore not a common practice to evaluate the… ▽ More We introduce a novel procedure for obtaining cross-validated predictive estimates for Bayesian hierarchical regression models (BHRMs). Bayesian hierarchical models are popular for their ability to model complex dependence structures and provide probabilistic uncertainty estimates, but can be computationally expensive to run. Cross-validation (CV) is therefore not a common practice to evaluate the predictive performance of BHRMs. Our method circumvents the need to re-run computationally costly estimation methods for each cross-validation fold and makes CV more feasible for large BHRMs. By conditioning on the variance-covariance parameters, we shift the CV problem from probability-based sampling to a simple and familiar optimization problem. In many cases, this produces estimates which are equivalent to full CV. We provide theoretical results and demonstrate its efficacy on publicly available data and in simulations. △ Less

Submitted 17 January, 2024; v1 submitted 28 November, 2020; originally announced November 2020.

Comments: 26 pages, 2 figures

arXiv:2011.02055 [pdf, other]

Self-Adaptively Learning to Demoire from Focused and Defocused Image Pairs

Authors: Lin Liu, Shanxin Yuan, Jianzhuang Liu, Li** Bao, Gregory Slabaugh, Qi Tian

Abstract: Moire artifacts are common in digital photography, resulting from the interference between high-frequency scene content and the color filter array of the camera. Existing deep learning-based demoireing methods trained on large scale datasets are limited in handling various complex moire patterns, and mainly focus on demoireing of photos taken of digital displays. Moreover, obtaining moire-free gro… ▽ More Moire artifacts are common in digital photography, resulting from the interference between high-frequency scene content and the color filter array of the camera. Existing deep learning-based demoireing methods trained on large scale datasets are limited in handling various complex moire patterns, and mainly focus on demoireing of photos taken of digital displays. Moreover, obtaining moire-free ground-truth in natural scenes is difficult but needed for training. In this paper, we propose a self-adaptive learning method for demoireing a high-frequency image, with the help of an additional defocused moire-free blur image. Given an image degraded with moire artifacts and a moire-free blur image, our network predicts a moire-free clean image and a blur kernel with a self-adaptive strategy that does not require an explicit training stage, instead performing test-time adaptation. Our model has two sub-networks and works iteratively. During each iteration, one sub-network takes the moire image as input, removing moire patterns and restoring image details, and the other sub-network estimates the blur kernel from the blur image. The two sub-networks are jointly optimized. Extensive experiments demonstrate that our method outperforms state-of-the-art methods and can produce high-quality demoired results. It can generalize well to the task of removing moire artifacts caused by display screens. In addition, we build a new moire dataset, including images with screen and texture moire artifacts. As far as we know, this is the first dataset with real texture moire patterns. △ Less

Submitted 5 November, 2020; v1 submitted 3 November, 2020; originally announced November 2020.

Comments: Accepted to NeurIPS 2020. Project page: "http://home.ustc.edu.cn/~ll0825/project_FDNet.html"

arXiv:2010.05562 [pdf, other]

High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies

Authors: Linchao Bao, Xiangkai Lin, Ya**g Chen, Haoxian Zhang, Sheng Wang, Xuefei Zhe, Di Kang, Haozhi Huang, Xinwei Jiang, Jue Wang, Dong Yu, Zhengyou Zhang

Abstract: We present a fully automatic system that can produce high-fidelity, photo-realistic 3D digital human heads with a consumer RGB-D selfie camera. The system only needs the user to take a short selfie RGB-D video while rotating his/her head, and can produce a high quality head reconstruction in less than 30 seconds. Our main contribution is a new facial geometry modeling and reflectance synthesis pro… ▽ More We present a fully automatic system that can produce high-fidelity, photo-realistic 3D digital human heads with a consumer RGB-D selfie camera. The system only needs the user to take a short selfie RGB-D video while rotating his/her head, and can produce a high quality head reconstruction in less than 30 seconds. Our main contribution is a new facial geometry modeling and reflectance synthesis procedure that significantly improves the state-of-the-art. Specifically, given the input video a two-stage frame selection procedure is first employed to select a few high-quality frames for reconstruction. Then a differentiable renderer based 3D Morphable Model (3DMM) fitting algorithm is applied to recover facial geometries from multiview RGB-D data, which takes advantages of a powerful 3DMM basis constructed with extensive data generation and perturbation. Our 3DMM has much larger expressive capacities than conventional 3DMM, allowing us to recover more accurate facial geometry using merely linear basis. For reflectance synthesis, we present a hybrid approach that combines parametric fitting and CNNs to synthesize high-resolution albedo/normal maps with realistic hair/pore/wrinkle details. Results show that our system can produce faithful 3D digital human faces with extremely realistic details. The main code and the newly constructed 3DMM basis is publicly available. △ Less

Submitted 29 June, 2021; v1 submitted 12 October, 2020; originally announced October 2020.

Comments: Code: https://github.com/tencent-ailab/hifi3dface

arXiv:2009.13751 [pdf, ps, other]

doi 10.1093/comjnl/bxab133

The star-structure connectivity and star-substructure connectivity of hypercubes and folded hypercubes

Authors: Lina Ba, He** Zhang

Abstract: As a generalization of vertex connectivity, for connected graphs $G$ and $T$, the $T$-structure connectivity $κ(G, T)$ (resp. $T$-substructure connectivity $κ^{s}(G, T)$) of $G$ is the minimum cardinality of a set of subgraphs $F$ of $G$ that each is isomorphic to $T$ (resp. to a connected subgraph of $T$) so that $G-F$ is disconnected. For $n$-dimensional hypercube $Q_{n}$, Lin et al. [6] showed… ▽ More As a generalization of vertex connectivity, for connected graphs $G$ and $T$, the $T$-structure connectivity $κ(G, T)$ (resp. $T$-substructure connectivity $κ^{s}(G, T)$) of $G$ is the minimum cardinality of a set of subgraphs $F$ of $G$ that each is isomorphic to $T$ (resp. to a connected subgraph of $T$) so that $G-F$ is disconnected. For $n$-dimensional hypercube $Q_{n}$, Lin et al. [6] showed $κ(Q_{n},K_{1,1})=κ^{s}(Q_{n},K_{1,1})=n-1$ and $κ(Q_{n},K_{1,r})=κ^{s}(Q_{n},K_{1,r})=\lceil\frac{n}{2}\rceil$ for $2\leq r\leq 3$ and $n\geq 3$. Sabir et al. [11] obtained that $κ(Q_{n},K_{1,4})=κ^{s}(Q_{n},K_{1,4})=\lceil\frac{n}{2}\rceil$ for $n\geq 6$, and for $n$-dimensional folded hypercube $FQ_{n}$, $κ(FQ_{n},K_{1,1})=κ^{s}(FQ_{n},K_{1,1})=n$, $κ(FQ_{n},K_{1,r})=κ^{s}(FQ_{n},K_{1,r})=\lceil\frac{n+1}{2}\rceil$ with $2\leq r\leq 3$ and $n\geq 7$. They proposed an open problem of determining $K_{1,r}$-structure connectivity of $Q_n$ and $FQ_n$ for general $r$. In this paper, we obtain that for each integer $r\geq 2$, $κ(Q_{n};K_{1,r})=κ^{s}(Q_{n};K_{1,r})=\lceil\frac{n}{2}\rceil$ and $κ(FQ_{n};K_{1,r})=κ^{s}(FQ_{n};K_{1,r})= \lceil\frac{n+1}{2}\rceil$ for all integers $n$ larger than $r$ in quare scale. For $4\leq r\leq 6$, we separately confirm the above result holds for $Q_n$ in the remaining cases. △ Less

Submitted 28 September, 2020; originally announced September 2020.

Journal ref: The Computer Journal 2021

arXiv:2008.13426 [pdf, other]

Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Authors: Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Wei Liu, Yun-hui Liu

Abstract: This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a… ▽ More This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts. △ Less

Submitted 28 January, 2021; v1 submitted 31 August, 2020; originally announced August 2020.

Comments: Accepted by TPAMI. An extension of our previous work at arXiv:1904.03597

Showing 1–50 of 81 results for author: Bao, L