Search | arXiv e-print repository

arXiv:2406.19070 [pdf, other]

FAGhead: Fully Animate Gaussian Head from Monocular Videos

Authors: Yixin Xuan, Xinyang Li, Gongxin Yao, Shiwei Zhou, Donghui Sun, Xiaoxin Chen, Yu Pan

Abstract: High-fidelity reconstruction of 3D human avatars has a wild application in visual reality. In this paper, we introduce FAGhead, a method that enables fully controllable human portraits from monocular videos. We explicit the traditional 3D morphable meshes (3DMM) and optimize the neutral 3D Gaussians to reconstruct with complex expressions. Furthermore, we employ a novel Point-based Learnable Repre… ▽ More High-fidelity reconstruction of 3D human avatars has a wild application in visual reality. In this paper, we introduce FAGhead, a method that enables fully controllable human portraits from monocular videos. We explicit the traditional 3D morphable meshes (3DMM) and optimize the neutral 3D Gaussians to reconstruct with complex expressions. Furthermore, we employ a novel Point-based Learnable Representation Field (PLRF) with learnable Gaussian point positions to enhance reconstruction performance. Meanwhile, to effectively manage the edges of avatars, we introduced the alpha rendering to supervise the alpha value of each pixel. Extensive experimental results on the open-source datasets and our capturing datasets demonstrate that our approach is able to generate high-fidelity 3D head avatars and fully control the expression and pose of the virtual avatars, which is outperforming than existing works. △ Less

Submitted 28 June, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.00301 [pdf, other]

A Survey on the Use of Partitioning in IoT-Edge-AI Applications

Authors: Guoxing Yao, Lav Gupta

Abstract: Centralized clouds processing the large amount of data generated by Internet-of-Things (IoT) can lead to unacceptable latencies for the end user. Against this backdrop, Edge Computing (EC) is an emerging paradigm that can address the shortcomings of traditional centralized Cloud Computing (CC). Its use is associated with improved performance, productivity, and security. Some of its use cases inclu… ▽ More Centralized clouds processing the large amount of data generated by Internet-of-Things (IoT) can lead to unacceptable latencies for the end user. Against this backdrop, Edge Computing (EC) is an emerging paradigm that can address the shortcomings of traditional centralized Cloud Computing (CC). Its use is associated with improved performance, productivity, and security. Some of its use cases include smart grids, healthcare Augmented Reality (AR)/Virtual Reality (VR). EC uses servers strategically placed near end users, reducing latency and proving to be particularly well-suited for time-sensitive IoT applications. It is expected to play a pivotal role in 6G and Industry 5.0. Within the IoT-edge environment, artificial intelligence (AI) plays an important role in automating decision and control, including but not limited to resource allocation activities, drawing inferences from large volumes of data, and enabling powerful security mechanisms. The use cases in the IoT-Edge-cloud environment tend to be complex resulting in large AI models, big datasets, and complex computations. This has led to researchers proposing techniques that partition data, tasks, models, or hybrid to achieve speed, efficiency, and accuracy of processing. This survey comprehensively explores the IoT-Edge-AI environment, application cases, and the partitioning techniques used. We categorize partitioning techniques and compare their performance. The survey concludes by identifying open research challenges in this domain. △ Less

Submitted 1 June, 2024; originally announced June 2024.

arXiv:2405.14074 [pdf]

Enhancing Critical Infrastructure Cybersecurity: Collaborative DNN Synthesis in the Cloud Continuum

Authors: Lav Gupta, Guoxing Yao

Abstract: Researchers are exploring the integration of IoT and the cloud continuum, together with AI to enhance the cost-effectiveness and efficiency of critical infrastructure (CI) systems. This integration, however, increases susceptibility of CI systems to cyberattacks, potentially leading to disruptions like power outages, oil spills, or even a nuclear mishap. CI systems are inherently complex and gener… ▽ More Researchers are exploring the integration of IoT and the cloud continuum, together with AI to enhance the cost-effectiveness and efficiency of critical infrastructure (CI) systems. This integration, however, increases susceptibility of CI systems to cyberattacks, potentially leading to disruptions like power outages, oil spills, or even a nuclear mishap. CI systems are inherently complex and generate vast amounts of heterogeneous and high-dimensional data, which crosses many trust boundaries in their journey across the IoT, edge, and cloud domains over the communication network interconnecting them. As a result, they face expanded attack surfaces. To ensure the security of these dataflows, researchers have used deep neural network models with encouraging results. Nevertheless, two important challenges that remain are tackling the computational complexity of these models to reduce convergence times and preserving the accuracy of detection of integrity-violating intrusions. In this paper, we propose an innovative approach that utilizes trained edge cloud models to synthesize central cloud models, effectively overcoming these challenges. We empirically validate the effectiveness of the proposed method by comparing it with traditional centralized and distributed techniques, including a contemporary collaborative technique. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.11993 [pdf, other]

GGAvatar: Geometric Adjustment of Gaussian Head Avatar

Authors: Xinyang Li, Jiaxin Wang, Yixin Xuan, Gongxin Yao, Yu Pan

Abstract: We propose GGAvatar, a novel 3D avatar representation designed to robustly model dynamic head avatars with complex identities and deformations. GGAvatar employs a coarse-to-fine structure, featuring two core modules: Neutral Gaussian Initialization Module and Geometry Morph Adjuster. Neutral Gaussian Initialization Module pairs Gaussian primitives with deformable triangular meshes, employing an ad… ▽ More We propose GGAvatar, a novel 3D avatar representation designed to robustly model dynamic head avatars with complex identities and deformations. GGAvatar employs a coarse-to-fine structure, featuring two core modules: Neutral Gaussian Initialization Module and Geometry Morph Adjuster. Neutral Gaussian Initialization Module pairs Gaussian primitives with deformable triangular meshes, employing an adaptive density control strategy to model the geometric structure of the target subject with neutral expressions. Geometry Morph Adjuster introduces deformation bases for each Gaussian in global space, creating fine-grained low-dimensional representations of deformation behaviors to address the Linear Blend Skinning formula's limitations effectively. Extensive experiments show that GGAvatar can produce high-fidelity renderings, outperforming state-of-the-art methods in visual quality and quantitative metrics. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: 9 pages, 5 figures

arXiv:2402.08910 [pdf, other]

Learning-based Bone Quality Classification Method for Spinal Metastasis

Authors: Shiqi Peng, Bolin Lai, Guangyu Yao, Xiaoyun Zhang, Ya Zhang, Yan-Feng Wang, Hui Zhao

Abstract: Spinal metastasis is the most common disease in bone metastasis and may cause pain, instability and neurological injuries. Early detection of spinal metastasis is critical for accurate staging and optimal treatment. The diagnosis is usually facilitated with Computed Tomography (CT) scans, which requires considerable efforts from well-trained radiologists. In this paper, we explore a learning-based… ▽ More Spinal metastasis is the most common disease in bone metastasis and may cause pain, instability and neurological injuries. Early detection of spinal metastasis is critical for accurate staging and optimal treatment. The diagnosis is usually facilitated with Computed Tomography (CT) scans, which requires considerable efforts from well-trained radiologists. In this paper, we explore a learning-based automatic bone quality classification method for spinal metastasis based on CT images. We simultaneously take the posterolateral spine involvement classification task into account, and employ multi-task learning (MTL) technique to improve the performance. MTL acts as a form of inductive bias which helps the model generalize better on each task by sharing representations between related tasks. Based on the prior knowledge that the mixed type can be viewed as both blastic and lytic, we model the task of bone quality classification as two binary classification sub-tasks, i.e., whether blastic and whether lytic, and leverage a multiple layer perceptron to combine their predictions. In order to make the model more robust and generalize better, self-paced learning is adopted to gradually involve from easy to more complex samples into the training process. The proposed learning-based method is evaluated on a proprietary spinal metastasis CT dataset. At slice level, our method significantly outperforms an 121-layer DenseNet classifier in sensitivities by $+12.54\%$, $+7.23\%$ and $+29.06\%$ for blastic, mixed and lytic lesions, respectively, meanwhile $+12.33\%$, $+23.21\%$ and $+34.25\%$ at vertebrae level. △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2402.08892 [pdf, other]

Weakly Supervised Segmentation of Vertebral Bodies with Iterative Slice-propagation

Authors: Shiqi Peng, Bolin Lai, Guangyu Yao, Xiaoyun Zhang, Ya Zhang, Yan-Feng Wang, Hui Zhao

Abstract: Vertebral body (VB) segmentation is an important preliminary step towards medical visual diagnosis for spinal diseases. However, most previous works require pixel/voxel-wise strong supervisions, which is expensive, tedious and time-consuming for experts to annotate. In this paper, we propose a Weakly supervised Iterative Spinal Segmentation (WISS) method leveraging only four corner landmark weak l… ▽ More Vertebral body (VB) segmentation is an important preliminary step towards medical visual diagnosis for spinal diseases. However, most previous works require pixel/voxel-wise strong supervisions, which is expensive, tedious and time-consuming for experts to annotate. In this paper, we propose a Weakly supervised Iterative Spinal Segmentation (WISS) method leveraging only four corner landmark weak labels on a single sagittal slice to achieve automatic volumetric segmentation from CT images for VBs. WISS first segments VBs on an annotated sagittal slice in an iterative self-training manner. This self-training method alternates between training and refining labels in the training set. Then WISS proceeds to segment the whole VBs slice by slice with a slice-propagation method to obtain volumetric segmentations. We evaluate the performance of WISS on a private spinal metastases CT dataset and the public lumbar CT dataset. On the first dataset, WISS achieves distinct improvements with regard to two different backbones. For the second dataset, WISS achieves dice coefficients of $91.7\%$ and $83.7\%$ for mid-sagittal slices and 3D CT volumes, respectively, saving a lot of labeling costs and only sacrificing a little segmentation performance. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: arXiv admin note: text overlap with arXiv:1412.7062 by other authors

arXiv:2310.01377 [pdf, other]

UltraFeedback: Boosting Language Models with High-quality Feedback

Authors: Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, Maosong Sun

Abstract: Reinforcement learning from human feedback (RLHF) has become a pivot technique in aligning large language models (LLMs) with human preferences. In RLHF practice, preference data plays a crucial role in bridging human proclivity and LLMs. However, the scarcity of diverse, naturalistic datasets of human preferences on LLM outputs at scale poses a great challenge to RLHF as well as feedback learning… ▽ More Reinforcement learning from human feedback (RLHF) has become a pivot technique in aligning large language models (LLMs) with human preferences. In RLHF practice, preference data plays a crucial role in bridging human proclivity and LLMs. However, the scarcity of diverse, naturalistic datasets of human preferences on LLM outputs at scale poses a great challenge to RLHF as well as feedback learning research within the open-source community. Current preference datasets, either proprietary or limited in size and prompt variety, result in limited RLHF adoption in open-source models and hinder further exploration. In this study, we propose ULTRAFEEDBACK, a large-scale, high-quality, and diversified preference dataset designed to overcome these limitations and foster RLHF development. To create ULTRAFEEDBACK, we compile a diverse array of instructions and models from multiple sources to produce comparative data. We meticulously devise annotation instructions and employ GPT-4 to offer detailed feedback in both numerical and textual forms. ULTRAFEEDBACK establishes a reproducible and expandable preference data construction pipeline, serving as a solid foundation for future RLHF and feedback learning research. Utilizing ULTRAFEEDBACK, we train various models to demonstrate its effectiveness, including the reward model UltraRM, chat language model UltraLM-13B-PPO, and critique model UltraCM. Experimental results indicate that our models outperform existing open-source models, achieving top performance across multiple benchmarks. Our data and models are available at https://github.com/thunlp/UltraFeedback. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2307.07142 [pdf, other]

Quantity-Aware Coarse-to-Fine Correspondence for Image-to-Point Cloud Registration

Authors: Gongxin Yao, Yixin Xuan, Yiwei Chen, Yu Pan

Abstract: Image-to-point cloud registration aims to determine the relative camera pose between an RGB image and a reference point cloud, serving as a general solution for locating 3D objects from 2D observations. Matching individual points with pixels can be inherently ambiguous due to modality gaps. To address this challenge, we propose a framework to capture quantity-aware correspondences between local po… ▽ More Image-to-point cloud registration aims to determine the relative camera pose between an RGB image and a reference point cloud, serving as a general solution for locating 3D objects from 2D observations. Matching individual points with pixels can be inherently ambiguous due to modality gaps. To address this challenge, we propose a framework to capture quantity-aware correspondences between local point sets and pixel patches and refine the results at both the point and pixel levels. This framework aligns the high-level semantics of point sets and pixel patches to improve the matching accuracy. On a coarse scale, the set-to-patch correspondence is expected to be influenced by the quantity of 3D points. To achieve this, a novel supervision strategy is proposed to adaptively quantify the degrees of correlation as continuous values. On a finer scale, point-to-pixel correspondences are refined from a smaller search space through a well-designed scheme, which incorporates both resampling and quantity-aware priors. Particularly, a confidence sorting strategy is proposed to proportionally select better correspondences at the final stage. Leveraging the advantages of high-quality correspondences, the problem is successfully resolved using an efficient Perspective-n-Point solver within the framework of random sample consensus (RANSAC). Extensive experiments on the KITTI Odometry and NuScenes datasets demonstrate the superiority of our method over the state-of-the-art methods. △ Less

Submitted 18 January, 2024; v1 submitted 13 July, 2023; originally announced July 2023.

arXiv:2302.13479 [pdf, other]

Age Minimization with Energy and Distortion Constraints

Authors: Guidan Yao, Chih-Chun Wang, Ness B. Shroff

Abstract: In this paper, we consider a status update system, where an access point collects measurements from multiple sensors that monitor a common physical process, fuses them, and transmits the aggregated sample to the destination over an erasure channel. Under a typical information fusion scheme, the distortion of the fused sample is inversely proportional to the number of measurements received. Our goa… ▽ More In this paper, we consider a status update system, where an access point collects measurements from multiple sensors that monitor a common physical process, fuses them, and transmits the aggregated sample to the destination over an erasure channel. Under a typical information fusion scheme, the distortion of the fused sample is inversely proportional to the number of measurements received. Our goal is to minimize the long-term average age while satisfying the average energy and general age-based distortion requirements. Specifically, we focus on the setting in which the distortion requirement is stricter when the age of the update is older. We show that the optimal policy is a mixture of two stationary, deterministic, threshold-based policies, each of which is optimal for a parameterized problem that aims to minimize the weighted sum of the age and energy under the distortion constraint. We then derive analytically the associated optimal average age-cost function and characterize its performance in the large threshold regime, the results of which shed critical insights on the tradeoff among age, energy, and the distortion of the samples. We have also developed a closed-form solution for the special case when the distortion requirement is independent of the age, arguably the most important setting for practical applications. △ Less

Submitted 26 February, 2023; originally announced February 2023.

arXiv:2201.02475 [pdf, other]

Deep Domain Adversarial Adaptation for Photon-efficient Imaging

Authors: Yiwei Chen, Gongxin Yao, Yong Liu, Hongye Su, Xiaomin Hu, Yu Pan

Abstract: Photon-efficient imaging with the single-photon light detection and ranging (LiDAR) captures the three-dimensional (3D) structure of a scene by only a few detected signal photons per pixel. However, the existing computational methods for photon-efficient imaging are pre-tuned on a restricted scenario or trained on simulated datasets. When applied to realistic scenarios whose signal-to-background r… ▽ More Photon-efficient imaging with the single-photon light detection and ranging (LiDAR) captures the three-dimensional (3D) structure of a scene by only a few detected signal photons per pixel. However, the existing computational methods for photon-efficient imaging are pre-tuned on a restricted scenario or trained on simulated datasets. When applied to realistic scenarios whose signal-to-background ratios (SBR) and other hardware-specific properties differ from those of the original task, the model performance often significantly deteriorates. In this paper, we present a domain adversarial adaptation design to alleviate this domain shift problem by exploiting unlabeled real-world data, with significant resource savings. This method demonstrates superior performance on simulated and real-world experiments using our home-built up-conversion single-photon imaging system, which provides an efficient approach to bypass the lack of ground-truth depth information in implementing computational imaging algorithms for realistic applications. △ Less

Submitted 27 October, 2022; v1 submitted 7 January, 2022; originally announced January 2022.

arXiv:2201.01453 [pdf, other]

doi 10.1364/OE.452597

Robust photon-efficient imaging using a pixel-wise residual shrinkage network

Authors: Gongxin Yao, Yiwei Chen, Yong Liu, Xiaomin Hu, Yu Pan

Abstract: Single-photon light detection and ranging (LiDAR) has been widely applied to 3D imaging in challenging scenarios. However, limited signal photon counts and high noises in the collected data have posed great challenges for predicting the depth image precisely. In this paper, we propose a pixel-wise residual shrinkage network for photon-efficient imaging from high-noise data, which adaptively genera… ▽ More Single-photon light detection and ranging (LiDAR) has been widely applied to 3D imaging in challenging scenarios. However, limited signal photon counts and high noises in the collected data have posed great challenges for predicting the depth image precisely. In this paper, we propose a pixel-wise residual shrinkage network for photon-efficient imaging from high-noise data, which adaptively generates the optimal thresholds for each pixel and denoises the intermediate features by soft thresholding. Besides, redefining the optimization target as pixel-wise classification provides a sharp advantage in producing confident and accurate depth estimation when compared with existing research. Comprehensive experiments conducted on both simulated and real-world datasets demonstrate that the proposed model outperforms the state-of-the-arts and maintains robust imaging performance under different signal-to-noise ratios including the extreme case of 1:100. △ Less

Submitted 18 May, 2022; v1 submitted 5 January, 2022; originally announced January 2022.

Journal ref: Optics Express 30(11):18856-18873, 2022

arXiv:2112.12390 [pdf, other]

Learning Implicit Body Representations from Double Diffusion Based Neural Radiance Fields

Authors: Guangming Yao, Hongzhi Wu, Yi Yuan, Lincheng Li, Kun Zhou, Xin Yu

Abstract: In this paper, we present a novel double diffusion based neural radiance field, dubbed DD-NeRF, to reconstruct human body geometry and render the human body appearance in novel views from a sparse set of images. We first propose a double diffusion mechanism to achieve expressive representations of input images by fully exploiting human body priors and image appearance details at two levels. At the… ▽ More In this paper, we present a novel double diffusion based neural radiance field, dubbed DD-NeRF, to reconstruct human body geometry and render the human body appearance in novel views from a sparse set of images. We first propose a double diffusion mechanism to achieve expressive representations of input images by fully exploiting human body priors and image appearance details at two levels. At the coarse level, we first model the coarse human body poses and shapes via an unclothed 3D deformable vertex model as guidance. At the fine level, we present a multi-view sampling network to capture subtle geometric deformations and image detailed appearances, such as clothing and hair, from multiple input views. Considering the sparsity of the two level features, we diffuse them into feature volumes in the canonical space to construct neural radiance fields. Then, we present a signed distance function (SDF) regression network to construct body surfaces from the diffused features. Thanks to our double diffused representations, our method can even synthesize novel views of unseen subjects. Experiments on various datasets demonstrate that our approach outperforms the state-of-the-art in both geometric reconstruction and novel view synthesis. △ Less

Submitted 17 January, 2022; v1 submitted 23 December, 2021; originally announced December 2021.

Comments: 6 pages, 5 figures

arXiv:2105.10112 [pdf, other]

IDEAL: Independent Domain Embedding Augmentation Learning

Authors: Zhiyuan Chen, Guang Yao, Wennan Ma, Lin Xu

Abstract: Many efforts have been devoted to designing sampling, mining, and weighting strategies in high-level deep metric learning (DML) loss objectives. However, little attention has been paid to low-level but essential data transformation. In this paper, we develop a novel mechanism, the independent domain embedding augmentation learning ({IDEAL}) method. It can simultaneously learn multiple independent… ▽ More Many efforts have been devoted to designing sampling, mining, and weighting strategies in high-level deep metric learning (DML) loss objectives. However, little attention has been paid to low-level but essential data transformation. In this paper, we develop a novel mechanism, the independent domain embedding augmentation learning ({IDEAL}) method. It can simultaneously learn multiple independent embedding spaces for multiple domains generated by predefined data transformations. Our IDEAL is orthogonal to existing DML techniques and can be seamlessly combined with prior DML approaches for enhanced performance. Empirical results on visual retrieval tasks demonstrate the superiority of the proposed method. For example, the IDEAL improves the performance of MS loss by a large margin, 84.5\% $\rightarrow$ 87.1\% on Cars-196, and 65.8\% $\rightarrow$ 69.5\% on CUB-200 at Recall$@1$. Our IDEAL with MS loss also achieves the new state-of-the-art performance on three image retrieval benchmarks, \ie, \emph{Cars-196}, \emph{CUB-200}, and \emph{SOP}. It outperforms the most recent DML approaches, such as Circle loss and XBM, significantly. The source code and pre-trained models of our method will be available at\emph{\url{https://github.com/emdata-ailab/IDEAL}}. △ Less

Submitted 20 May, 2021; originally announced May 2021.

Comments: 11 pages, 2 figures, 4 tables

arXiv:2105.04286 [pdf, other]

Primitive Representation Learning for Scene Text Recognition

Authors: Ruijie Yan, Liangrui Peng, Shanyu Xiao, Gang Yao

Abstract: Scene text recognition is a challenging task due to diverse variations of text instances in natural scene images. Conventional methods based on CNN-RNN-CTC or encoder-decoder with attention mechanism may not fully investigate stable and efficient feature representations for multi-oriented scene texts. In this paper, we propose a primitive representation learning method that aims to exploit intrins… ▽ More Scene text recognition is a challenging task due to diverse variations of text instances in natural scene images. Conventional methods based on CNN-RNN-CTC or encoder-decoder with attention mechanism may not fully investigate stable and efficient feature representations for multi-oriented scene texts. In this paper, we propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images. We model elements in feature maps as the nodes of an undirected graph. A pooling aggregator and a weighted aggregator are proposed to learn primitive representations, which are transformed into high-level visual text representations by graph convolutional networks. A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding. Furthermore, by integrating visual text representations into an encoder-decoder model with the 2D attention mechanism, we propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods. Experimental results on both English and Chinese scene text recognition tasks demonstrate that PREN keeps a balance between accuracy and efficiency, while PREN2D achieves state-of-the-art performance. △ Less

Submitted 10 May, 2021; originally announced May 2021.

arXiv:2105.02039 [pdf, other]

Towards an efficient framework for Data Extraction from Chart Images

Authors: Weihong Ma, Hesuo Zhang, Shuang Yan, Guangshun Yao, Yichao Huang, Hui Li, Yaqiang Wu, Lianwen **

Abstract: In this paper, we fill the research gap by adopting state-of-the-art computer vision techniques for the data extraction stage in a data mining system. As shown in Fig.1, this stage contains two subtasks, namely, plot element detection and data conversion. For building a robust box detector, we comprehensively compare different deep learning-based methods and find a suitable method to detect box wi… ▽ More In this paper, we fill the research gap by adopting state-of-the-art computer vision techniques for the data extraction stage in a data mining system. As shown in Fig.1, this stage contains two subtasks, namely, plot element detection and data conversion. For building a robust box detector, we comprehensively compare different deep learning-based methods and find a suitable method to detect box with high precision. For building a robust point detector, a fully convolutional network with feature fusion module is adopted, which can distinguish close points compared to traditional methods. The proposed system can effectively handle various chart data without making heuristic assumptions. For data conversion, we translate the detected element into data with semantic value. A network is proposed to measure feature similarities between legends and detected elements in the legend matching phase. Furthermore, we provide a baseline on the competition of Harvesting raw tables from Infographics. Some key factors have been found to improve the performance of each stage. Experimental results demonstrate the effectiveness of the proposed system. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: accepted by ICDAR2021

arXiv:2102.03984 [pdf, other]

One-shot Face Reenactment Using Appearance Adaptive Normalization

Authors: Guangming Yao, Yi Yuan, Tianjia Shao, Shuang Li, Shanqi Liu, Yong Liu, Mengmeng Wang, Kun Zhou

Abstract: The paper proposes a novel generative adversarial network for one-shot face reenactment, which can animate a single face image to a different pose-and-expression (provided by a driving image) while kee** its original appearance. The core of our network is a novel mechanism called appearance adaptive normalization, which can effectively integrate the appearance information from the input image in… ▽ More The paper proposes a novel generative adversarial network for one-shot face reenactment, which can animate a single face image to a different pose-and-expression (provided by a driving image) while kee** its original appearance. The core of our network is a novel mechanism called appearance adaptive normalization, which can effectively integrate the appearance information from the input image into our face generator by modulating the feature maps of the generator using the learned adaptive parameters. Furthermore, we specially design a local net to reenact the local facial components (i.e., eyes, nose and mouth) first, which is a much easier task for the network to learn and can in turn provide explicit anchors to guide our face generator to learn the global appearance and pose-and-expression. Extensive quantitative and qualitative experiments demonstrate the significant efficacy of our model compared with prior one-shot methods. △ Less

Submitted 26 April, 2021; v1 submitted 7 February, 2021; originally announced February 2021.

Comments: 9 pages, 8 figures,3 tables ,Accepted by AAAI2021

arXiv:2012.09351 [pdf, other]

Battle between Rate and Error in Minimizing Age of Information

Authors: Guidan Yao, Ahmed M. Bedewy, Ness B. Shroff

Abstract: In this paper, we consider a status update system, in which update packets are sent to the destination via a wireless medium that allows for multiple rates, where a higher rate also naturally corresponds to a higher error probability. The data freshness is measured using age of information, which is defined as the age of the recent update at the destination. A packet that is transmitted with a hig… ▽ More In this paper, we consider a status update system, in which update packets are sent to the destination via a wireless medium that allows for multiple rates, where a higher rate also naturally corresponds to a higher error probability. The data freshness is measured using age of information, which is defined as the age of the recent update at the destination. A packet that is transmitted with a higher rate, will encounter a shorter delay and a higher error probability. Thus, the choice of the transmission rate affects the age at the destination. In this paper, we design a low-complexity scheduler that selects between two different transmission rate and error probability pairs to be used at each transmission epoch. This problem can be cast as a Markov Decision Process. We show that there exists a threshold-type policy that is age-optimal. More importantly, we show that the objective function is quasi-convex or non-decreasing in the threshold, based on to the system parameters values. This enables us to devise a \emph{low-complexity algorithm} to minimize the age. These results reveal an interesting phenomenon: While choosing the rate with minimum mean delay is delay-optimal, this does not necessarily minimize the age. △ Less

Submitted 28 December, 2020; v1 submitted 16 December, 2020; originally announced December 2020.

arXiv:2012.02958 [pdf, other]

Age-Optimal Low-Power Status Update over Time-Correlated Fading Channel

Authors: Guidan Yao, Ahmed M. Bedewy, Ness B. Shroff

Abstract: In this paper, we consider transmission scheduling in a status update system, where updates are generated periodically and transmitted over a Gilbert-Elliott fading channel. The goal is to minimize the long-run average age of information (AoI) at the destination under an average energy constraint. We consider two practical cases to obtain channel state information (CSI): (i) \emph{without channel… ▽ More In this paper, we consider transmission scheduling in a status update system, where updates are generated periodically and transmitted over a Gilbert-Elliott fading channel. The goal is to minimize the long-run average age of information (AoI) at the destination under an average energy constraint. We consider two practical cases to obtain channel state information (CSI): (i) \emph{without channel sensing} and (ii) \emph{with delayed channel sensing}. For case (i), the channel state is revealed when an ACK/NACK is received at the transmitter following a transmission, but when no transmission occurs, the channel state is not revealed. Thus, we have to design schemes that balance tradeoffs across energy, AoI, channel exploration, and channel exploitation. The problem is formulated as a constrained partially observable Markov decision process problem (POMDP). To reduce algorithm complexity, we show that the optimal policy is a randomized mixture of no more than two stationary deterministic policies each of which is of a threshold-type in the belief on the channel. For case (ii), (delayed) CSI is available at the transmitter via channel sensing. In this case, the tradeoff is only between the AoI and energy consumption and the problem is formulated as a constrained MDP. The optimal policy is shown to have a similar structure as in case (i) but with an AoI associated threshold. Finally, the performance of the proposed structure-aware algorithms is evaluated numerically and compared with a Greedy policy. △ Less

Submitted 31 January, 2021; v1 submitted 5 December, 2020; originally announced December 2020.

arXiv:2008.07783 [pdf, other]

doi 10.1145/3394171.3413865

Mesh Guided One-shot Face Reenactment using Graph Convolutional Networks

Authors: Guangming Yao, Yi Yuan, Tianjia Shao, Kun Zhou

Abstract: Face reenactment aims to animate a source face image to a different pose and expression provided by a driving image. Existing approaches are either designed for a specific identity, or suffer from the identity preservation problem in the one-shot or few-shot scenarios. In this paper, we introduce a method for one-shot face reenactment, which uses the reconstructed 3D meshes (i.e., the source mesh… ▽ More Face reenactment aims to animate a source face image to a different pose and expression provided by a driving image. Existing approaches are either designed for a specific identity, or suffer from the identity preservation problem in the one-shot or few-shot scenarios. In this paper, we introduce a method for one-shot face reenactment, which uses the reconstructed 3D meshes (i.e., the source mesh and driving mesh) as guidance to learn the optical flow needed for the reenacted face synthesis. Technically, we explicitly exclude the driving face's identity information in the reconstructed driving mesh. In this way, our network can focus on the motion estimation for the source face without the interference of driving face shape. We propose a motion net to learn the face motion, which is an asymmetric autoencoder. The encoder is a graph convolutional network (GCN) that learns a latent motion vector from the meshes, and the decoder serves to produce an optical flow image from the latent vector with CNNs. Compared to previous methods using sparse keypoints to guide the optical flow learning, our motion net learns the optical flow directly from 3D dense meshes, which provide the detailed shape and pose information for the optical flow, so it can achieve more accurate expression and pose on the reenacted face. Extensive experiments show that our method can generate high-quality results and outperforms state-of-the-art methods in both qualitative and quantitative comparisons. △ Less

Submitted 18 September, 2020; v1 submitted 18 August, 2020; originally announced August 2020.

Comments: 9 pages, 8 figures,accepted by ACM MM2020

arXiv:2004.05233 [pdf, other]

Shape Estimation for Elongated Deformable Object using B-spline Chained Multiple Random Matrices Model

Authors: Gang Yao, Ryan Saltus, Ashwin Dani

Abstract: In this paper, a B-spline chained multiple random matrices representation is proposed to model geometric characteristics of an elongated deformable object. The hyper degrees of freedom structure of the elongated deformable object make its shape estimation challenging. Based on the likelihood function of the proposed model, an expectation-maximization (EM) method is derived to estimate the shape of… ▽ More In this paper, a B-spline chained multiple random matrices representation is proposed to model geometric characteristics of an elongated deformable object. The hyper degrees of freedom structure of the elongated deformable object make its shape estimation challenging. Based on the likelihood function of the proposed model, an expectation-maximization (EM) method is derived to estimate the shape of the elongated deformable object. A split and merge method based on the Euclidean minimum spanning tree (EMST) is proposed to provide initialization for the EM algorithm. The proposed algorithm is evaluated for the shape estimation of the elongated deformable objects in scenarios, such as the static rope with various configurations (including configurations with intersection), the continuous manipulation of a rope and a plastic tube, and the assembly of two plastic tubes. The execution time is computed and the accuracy of the shape estimation results is evaluated based on the comparisons between the estimated width values and its ground-truth, and the intersection over union (IoU) metric. △ Less

Submitted 10 April, 2020; originally announced April 2020.

arXiv:2003.09615 [pdf, other]

DP-Net: Dynamic Programming Guided Deep Neural Network Compression

Authors: Dingcheng Yang, Wenjian Yu, Ao Zhou, Haoyuan Mu, Gary Yao, Xiaoyi Wang

Abstract: In this work, we propose an effective scheme (called DP-Net) for compressing the deep neural networks (DNNs). It includes a novel dynamic programming (DP) based algorithm to obtain the optimal solution of weight quantization and an optimization process to train a clustering-friendly DNN. Experiments showed that the DP-Net allows larger compression than the state-of-the-art counterparts while prese… ▽ More In this work, we propose an effective scheme (called DP-Net) for compressing the deep neural networks (DNNs). It includes a novel dynamic programming (DP) based algorithm to obtain the optimal solution of weight quantization and an optimization process to train a clustering-friendly DNN. Experiments showed that the DP-Net allows larger compression than the state-of-the-art counterparts while preserving accuracy. The largest 77X compression ratio on Wide ResNet is achieved by combining DP-Net with other compression techniques. Furthermore, the DP-Net is extended for compressing a robust DNN model with negligible accuracy loss. At last, a custom accelerator is designed on FPGA to speed up the inference computation with DP-Net. △ Less

Submitted 21 March, 2020; originally announced March 2020.

Comments: 7pages, 4 figures

arXiv:2003.00835

Deep Variational Luenberger-type Observer for Stochastic Video Prediction

Authors: Dong Wang, Feng Zhou, Zheng Yan, Guang Yao, Zongxuan Liu, Wennan Ma, Cewu Lu

Abstract: Considering the inherent stochasticity and uncertainty, predicting future video frames is exceptionally challenging. In this work, we study the problem of video prediction by combining interpretability of stochastic state space models and representation learning of deep neural networks. Our model builds upon an variational encoder which transforms the input video into a latent feature space and a… ▽ More Considering the inherent stochasticity and uncertainty, predicting future video frames is exceptionally challenging. In this work, we study the problem of video prediction by combining interpretability of stochastic state space models and representation learning of deep neural networks. Our model builds upon an variational encoder which transforms the input video into a latent feature space and a Luenberger-type observer which captures the dynamic evolution of the latent features. This enables the decomposition of videos into static features and dynamics in an unsupervised manner. By deriving the stability theory of the nonlinear Luenberger-type observer, the hidden states in the feature space become insensitive with respect to the initial values, which improves the robustness of the overall model. Furthermore, the variational lower bound on the data log-likelihood can be derived to obtain the tractable posterior prediction distribution based on the variational principle. Finally, the experiments such as the Bouncing Balls dataset and the Pendulum dataset are provided to demonstrate the proposed model outperforms concurrent works. △ Less

Submitted 10 September, 2023; v1 submitted 12 February, 2020; originally announced March 2020.

Comments: rewrite paper

arXiv:1912.01054 [pdf, other]

The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 Challenge

Authors: Nicholas Heller, Fabian Isensee, Klaus H. Maier-Hein, Xiaoshuai Hou, Chunmei Xie, Fengyi Li, Yang Nan, Guangrui Mu, Zhiyong Lin, Miofei Han, Guang Yao, Yaozong Gao, Yao Zhang, Yixin Wang, Feng Hou, Jiawei Yang, Guangwei Xiong, Jiang Tian, Cheng Zhong, Jun Ma, Jack Rickman, Joshua Dean, Bethany Stai, Resha Tejpaul, Makinna Oestreich , et al. (16 additional authors not shown)

Abstract: There is a large body of literature linking anatomic and geometric characteristics of kidney tumors to perioperative and oncologic outcomes. Semantic segmentation of these tumors and their host kidneys is a promising tool for quantitatively characterizing these lesions, but its adoption is limited due to the manual effort required to produce high-quality 3D segmentations of these structures. Recen… ▽ More There is a large body of literature linking anatomic and geometric characteristics of kidney tumors to perioperative and oncologic outcomes. Semantic segmentation of these tumors and their host kidneys is a promising tool for quantitatively characterizing these lesions, but its adoption is limited due to the manual effort required to produce high-quality 3D segmentations of these structures. Recently, methods based on deep learning have shown excellent results in automatic 3D segmentation, but they require large datasets for training, and there remains little consensus on which methods perform best. The 2019 Kidney and Kidney Tumor Segmentation challenge (KiTS19) was a competition held in conjunction with the 2019 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) which sought to address these issues and stimulate progress on this automatic segmentation problem. A training set of 210 cross sectional CT images with kidney tumors was publicly released with corresponding semantic segmentation masks. 106 teams from five continents used this data to develop automated systems to predict the true segmentation masks on a test set of 90 CT images for which the corresponding ground truth segmentations were kept private. These predictions were scored and ranked according to their average So rensen-Dice coefficient between the kidney and tumor across all 90 cases. The winning team achieved a Dice of 0.974 for kidney and 0.851 for tumor, approaching the inter-annotator performance on kidney (0.983) but falling short on tumor (0.923). This challenge has now entered an "open leaderboard" phase where it serves as a challenging benchmark in 3D semantic segmentation. △ Less

Submitted 7 August, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

Comments: 24 pages, 11 figures

arXiv:1911.01002 [pdf, other]

Generalized NLFSR Transformation Algorithms and Cryptanalysis of the Class of Espresso-like Stream Ciphers

Authors: Ge Yao, Udaya Parampalli

Abstract: Lightweight stream ciphers are highly demanded in IoT applications. In order to optimize the hardware performance, a new class of stream cipher has been proposed. The basic idea is to employ a single Galois NLFSR with maximum period to construct the cipher. As a representative design of this kind of stream ciphers, Espresso is based on a 256-bit Galois NLFSR initialized by a 128-bit key. The… ▽ More Lightweight stream ciphers are highly demanded in IoT applications. In order to optimize the hardware performance, a new class of stream cipher has been proposed. The basic idea is to employ a single Galois NLFSR with maximum period to construct the cipher. As a representative design of this kind of stream ciphers, Espresso is based on a 256-bit Galois NLFSR initialized by a 128-bit key. The $2^{256}-1$ maximum period is assured because the Galois NLFSR is transformed from a maximum length LFSR. However, we propose a Galois-to-Fibonacci transformation algorithm and successfully transform the Galois NLFSR into a Fibonacci LFSR with a nonlinear output function. The transformed cipher is broken by the standard algebraic attack and the Rønjom-Helleseth attack with complexity $\mathcal{O}(2^{68.44})$ and $\mathcal{O}(2^{66.86})$ respectively. The transformation algorithm is derived from a new Fibonacci-to-Galois transformation algorithm we propose in this paper. Compare to existing algorithms, proposed algorithms are more efficient and cover more general use cases. Moreover, the transformation result shows that the Galois NLFSR used in any Espresso-like stream ciphers can be easily transformed back into the original Fibonacci LFSR. Therefore, this kind of design should be avoided in the future. △ Less

Submitted 3 November, 2019; originally announced November 2019.

arXiv:1901.00963 [pdf, other]

Integrating Sub-6 GHz and Millimeter Wave to Combat Blockage: Delay-Optimal Scheduling

Authors: Guidan Yao, Morteza Hashemi, Ness B. Shroff

Abstract: Millimeter wave (mmWave) technologies have the potential to achieve very high data rates, but suffer from intermittent connectivity. In this paper, we provision an architecture to integrate sub-6 GHz and mmWave technologies, where we incorporate the sub-6 GHz interface as a fallback data transfer mechanism to combat blockage and intermittent connectivity of the mmWave communications. To this end,… ▽ More Millimeter wave (mmWave) technologies have the potential to achieve very high data rates, but suffer from intermittent connectivity. In this paper, we provision an architecture to integrate sub-6 GHz and mmWave technologies, where we incorporate the sub-6 GHz interface as a fallback data transfer mechanism to combat blockage and intermittent connectivity of the mmWave communications. To this end, we investigate the problem of scheduling data packets across the mmWave and sub-6 GHz interfaces such that the average delay of system is minimized. This problem can be formulated as Markov Decision Process. We first investigate the problem of discounted delay minimization, and prove that the optimal policy is of the threshold-type, i.e., data packets should always be routed to the mmWave interface as long as the number of packets in the system is smaller than a threshold. Then, we show that the results of the discounted delay problem hold for the average delay problem as well. Through numerical results, we demonstrate that under heavy traffic, integrating sub-6 GHz with mmWave can reduce the average delay by up to 70%. Further, our scheduling policy substantially reduces the delay over the celebrated MaxWeight policy. △ Less

Submitted 21 January, 2019; v1 submitted 3 January, 2019; originally announced January 2019.

arXiv:1804.03036 [pdf, other]

Image Moment Models for Extended Object Tracking

Authors: Gang Yao, Ashwin Dani

Abstract: In this paper, a novel image moments based model for shape estimation and tracking of an object moving with a complex trajectory is presented. The camera is assumed to be stationary looking at a moving object. Point features inside the object are sampled as measurements. An ellipsoidal approximation of the shape is assumed as a primitive shape. The shape of an ellipse is estimated using a combinat… ▽ More In this paper, a novel image moments based model for shape estimation and tracking of an object moving with a complex trajectory is presented. The camera is assumed to be stationary looking at a moving object. Point features inside the object are sampled as measurements. An ellipsoidal approximation of the shape is assumed as a primitive shape. The shape of an ellipse is estimated using a combination of image moments. Dynamic model of image moments when the object moves under the constant velocity or coordinated turn motion model is derived as a function for the shape estimation of the object. An Unscented Kalman Filter-Interacting Multiple Model (UKF-IMM) filter algorithm is applied to estimate the shape of the object (approximated as an ellipse) and track its position and velocity. A likelihood function based on average log-likelihood is derived for the IMM filter. Simulation results of the proposed UKF-IMM algorithm with the image moments based models are presented that show the estimations of the shape of the object moving in complex trajectories. Comparison results, using intersection over union (IOU), and position and velocity root mean square errors (RMSE) as metrics, with a benchmark algorithm from literature are presented. Results on real image data captured from the quadcopter are also presented. △ Less

Submitted 9 April, 2018; originally announced April 2018.

Journal ref: IEEE Transactions on Aerospace and Electronic Systems, 2018

arXiv:1804.02470 [pdf, other]

Visual Tracking Using Sparse Coding and Earth Mover's Distance

Authors: Gang Yao, Ashwin Dani

Abstract: An efficient iterative Earth Mover's Distance (iEMD) algorithm for visual tracking is proposed in this paper. The Earth Mover's Distance (EMD) is used as the similarity measure to search for the optimal template candidates in feature-spatial space in a video sequence. The computation of the EMD is formulated as the transportation problem from linear programming. The efficiency of the EMD optimizat… ▽ More An efficient iterative Earth Mover's Distance (iEMD) algorithm for visual tracking is proposed in this paper. The Earth Mover's Distance (EMD) is used as the similarity measure to search for the optimal template candidates in feature-spatial space in a video sequence. The computation of the EMD is formulated as the transportation problem from linear programming. The efficiency of the EMD optimization problem limits its use for visual tracking. To alleviate this problem, a transportation-simplex method is used for EMD optimization and a monotonically convergent iterative optimization algorithm is developed. The local sparse representation is used as the appearance models for the iEMD tracker. The maximum-alignment-pooling method is used for constructing a sparse coding histogram which reduces the computational complexity of the EMD optimization. The template update algorithm based on the EMD is also presented. The iEMD tracking algorithm assumes small inter-frame movement in order to guarantee convergence. When the camera is mounted on a moving robot, e.g., a flying quadcopter, the camera could experience a sudden and rapid motion leading to large inter-frame movements. To ensure that the tracking algorithm converges, a gyro-aided extension of the iEMD tracker is presented, where synchronized gyroscope information is utilized to compensate for the rotation of the camera. The iEMD algorithm's performance is evaluated using eight publicly available datasets. The performance of the iEMD algorithm is compared with seven state-of-the-art tracking algorithms based on relative percentage overlap. The robustness of this algorithm for large inter-frame displacements is also illustrated. △ Less

Submitted 6 April, 2018; originally announced April 2018.

Showing 1–27 of 27 results for author: Yao, G