Search | arXiv e-print repository

Cross-Slice Attention and Evidential Critical Loss for Uncertainty-Aware Prostate Cancer Detection

Authors: Alex Ling Yu Hung, Haoxin Zheng, Kai Zhao, Kaifeng Pang, Demetri Terzopoulos, Kyunghyun Sung

Abstract: Current deep learning-based models typically analyze medical images in either 2D or 3D albeit disregarding volumetric information or suffering sub-optimal performance due to the anisotropic resolution of MR data. Furthermore, providing an accurate uncertainty estimation is beneficial to clinicians, as it indicates how confident a model is about its prediction. We propose a novel 2.5D cross-slice a… ▽ More Current deep learning-based models typically analyze medical images in either 2D or 3D albeit disregarding volumetric information or suffering sub-optimal performance due to the anisotropic resolution of MR data. Furthermore, providing an accurate uncertainty estimation is beneficial to clinicians, as it indicates how confident a model is about its prediction. We propose a novel 2.5D cross-slice attention model that utilizes both global and local information, along with an evidential critical loss, to perform evidential deep learning for the detection in MR images of prostate cancer, one of the most common cancers and a leading cause of cancer-related death in men. We perform extensive experiments with our model on two different datasets and achieve state-of-the-art performance in prostate cancer detection along with improved epistemic uncertainty estimation. The implementation of the model is available at https://github.com/aL3x-O-o-Hung/GLCSA_ECLoss. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2405.07478 [pdf, other]

Coded Event-triggered Control for Nonlinear Systems

Authors: Ruihang Ji, Shuzhi Sam Ge, Kai Zhao

Abstract: This paper studies a Coded Event-triggered Control (CEC) for a class of nonlinear systems under any initial condition. To reduce communication burden, the CEC is designed from the encoding-decoding viewpoint by which only $m$-length string is transmitted for each communication between CEC and actuator. If a more general Entry Capture Problem is encountered, such control design will be rather compl… ▽ More This paper studies a Coded Event-triggered Control (CEC) for a class of nonlinear systems under any initial condition. To reduce communication burden, the CEC is designed from the encoding-decoding viewpoint by which only $m$-length string is transmitted for each communication between CEC and actuator. If a more general Entry Capture Problem is encountered, such control design will be rather complicated yet challenging where the performance constraints are satisfied some time after (rather than from the beginning of) system operation, rendering normally employed prescribed performance control invalid because they may be not defined in the initial interval. By introducing auxiliary functions, we develop a Self-adjustable Prescribed Performance (SPP) mechanism which can flexibly adjust the symmetric or asymmetric performance boundaries to accommodate different initial conditions, providing an effective solution for the underlying tracking problem. In this way, the resulted CEC can not only consume less communication resources but also regulate the tracking error under any initial condition into an allowable set before a given time in a bounded and customizable manner. Simulation results verify and clarify the theoretical findings. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2403.09651 [pdf, other]

Precision Agriculture: Crop Map** using Machine Learning and Sentinel-2 Satellite Imagery

Authors: Kui Zhao, Siyang Wu, Chang Liu, Yue Wu, Natalia Efremova

Abstract: Food security has grown in significance due to the changing climate and its warming effects. To support the rising demand for agricultural products and to minimize the negative impact of climate change and mass cultivation, precision agriculture has become increasingly important for crop cultivation. This study employs deep learning and pixel-based machine learning methods to accurately segment la… ▽ More Food security has grown in significance due to the changing climate and its warming effects. To support the rising demand for agricultural products and to minimize the negative impact of climate change and mass cultivation, precision agriculture has become increasingly important for crop cultivation. This study employs deep learning and pixel-based machine learning methods to accurately segment lavender fields for precision agriculture, utilizing various spectral band combinations extracted from Sentinel-2 satellite imagery. Our fine-tuned final model, a U-Net architecture, can achieve a Dice coefficient of 0.8324. Additionally, our investigation highlights the unexpected efficacy of the pixel-based method and the RGB spectral band combination in this task. △ Less

Submitted 25 November, 2023; originally announced March 2024.

arXiv:2403.03629 [pdf, other]

Spatially Selective Reconfigurable Intelligent Surfaces Through Element Permutation

Authors: Fredrik Rusek, Jose Flordelis, Kun Zhao, Erik Bengtsson, Olof Zander

Abstract: A standard reconfigurable intelligent surface (RIS) can be configured to reflect signals from an arbitrary im**ing direction to an arbitrary outgoing direction. However, if a signal im**es from any other direction, said signal is reflected, with full beamforming gain, to a specific direction, which is easily determined. The goal of this paper is to propose a RIS which \emph{only} reflects sign… ▽ More A standard reconfigurable intelligent surface (RIS) can be configured to reflect signals from an arbitrary im**ing direction to an arbitrary outgoing direction. However, if a signal im**es from any other direction, said signal is reflected, with full beamforming gain, to a specific direction, which is easily determined. The goal of this paper is to propose a RIS which \emph{only} reflects signals from the configured im**ing direction. This can be accomplished by a RIS architecture that permutes the antenna elements in the sense that a signal is re-radiated from a different antenna than the one receiving the signal. We analytically prove this fact, and also discuss several variants and hardware implementations. △ Less

Submitted 6 March, 2024; originally announced March 2024.

Comments: ICC 2024, 6 pages, 4 figures

arXiv:2402.14349 [pdf, other]

Uncertainty-driven and Adversarial Calibration Learning for Epicardial Adipose Tissue Segmentation

Authors: Kai Zhao, Zhiming Liu, Jiaqi Liu, **gbiao Zhou, Bihong Liao, Huifang Tang, Qiuyu Wang, Chunquan Li

Abstract: Epicardial adipose tissue (EAT) is a type of visceral fat that can secrete large amounts of adipokines to affect the myocardium and coronary arteries. EAT volume and density can be used as independent risk markers measurement of volume by noninvasive magnetic resonance images is the best method of assessing EAT. However, segmenting EAT is challenging due to the low contrast between EAT and pericar… ▽ More Epicardial adipose tissue (EAT) is a type of visceral fat that can secrete large amounts of adipokines to affect the myocardium and coronary arteries. EAT volume and density can be used as independent risk markers measurement of volume by noninvasive magnetic resonance images is the best method of assessing EAT. However, segmenting EAT is challenging due to the low contrast between EAT and pericardial effusion and the presence of motion artifacts. we propose a novel feature latent space multilevel supervision network (SPDNet) with uncertainty-driven and adversarial calibration learning to enhance segmentation for more accurate EAT volume estimation. The network first addresses the blurring of EAT edges due to the medical images in the open medical environments with low quality or out-of-distribution by modeling the uncertainty as a Gaussian distribution in the feature latent space, which using its Bayesian estimation as a regularization constraint to optimize SwinUNETR. Second, an adversarial training strategy is introduced to calibrate the segmentation feature map and consider the multi-scale feature differences between the uncertainty-guided predictive segmentation and the ground truth segmentation, synthesizing the multi-scale adversarial loss directly improves the ability to discriminate the similarity between organizations. Experiments on both the cardiac public MRI dataset (ACDC) and the real-world clinical cohort EAT dataset show that the proposed network outperforms mainstream models, validating that uncertainty-driven and adversarial calibration learning can be used to provide additional information for modeling multi-scale ambiguities. △ Less

Submitted 23 February, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: 13 pages,7 figuers

arXiv:2312.10921 [pdf, other]

AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis

Authors: Dongze Li, Kang Zhao, Wei Wang, Bo Peng, Yingya Zhang, **g Dong, Tieniu Tan

Abstract: Audio-driven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality. Recent NeRF-based approaches have shown superiority in quality and fidelity compared to previous studies. However, when it comes to few-shot talking head generation, a practical scenario where only few seconds of talking video is available for one identity, two limitat… ▽ More Audio-driven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality. Recent NeRF-based approaches have shown superiority in quality and fidelity compared to previous studies. However, when it comes to few-shot talking head generation, a practical scenario where only few seconds of talking video is available for one identity, two limitations emerge: 1) they either have no base model, which serves as a facial prior for fast convergence, or ignore the importance of audio when building the prior; 2) most of them overlook the degree of correlation between different face regions and audio, e.g., mouth is audio related, while ear is audio independent. In this paper, we present Audio Enhanced Neural Radiance Field (AE-NeRF) to tackle the above issues, which can generate realistic portraits of a new speaker with fewshot dataset. Specifically, we introduce an Audio Aware Aggregation module into the feature fusion stage of the reference scheme, where the weight is determined by the similarity of audio between reference and target image. Then, an Audio-Aligned Face Generation strategy is proposed to model the audio related and audio independent regions respectively, with a dual-NeRF framework. Extensive experiments have shown AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations. △ Less

Submitted 17 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI 2024

arXiv:2311.04942 [pdf, other]

CSAM: A 2.5D Cross-Slice Attention Module for Anisotropic Volumetric Medical Image Segmentation

Authors: Alex Ling Yu Hung, Haoxin Zheng, Kai Zhao, Xiaoxi Du, Kaifeng Pang, Qi Miao, Steven S. Raman, Demetri Terzopoulos, Kyunghyun Sung

Abstract: A large portion of volumetric medical data, especially magnetic resonance imaging (MRI) data, is anisotropic, as the through-plane resolution is typically much lower than the in-plane resolution. Both 3D and purely 2D deep learning-based segmentation methods are deficient in dealing with such volumetric data since the performance of 3D methods suffers when confronting anisotropic data, and 2D meth… ▽ More A large portion of volumetric medical data, especially magnetic resonance imaging (MRI) data, is anisotropic, as the through-plane resolution is typically much lower than the in-plane resolution. Both 3D and purely 2D deep learning-based segmentation methods are deficient in dealing with such volumetric data since the performance of 3D methods suffers when confronting anisotropic data, and 2D methods disregard crucial volumetric information. Insufficient work has been done on 2.5D methods, in which 2D convolution is mainly used in concert with volumetric information. These models focus on learning the relationship across slices, but typically have many parameters to train. We offer a Cross-Slice Attention Module (CSAM) with minimal trainable parameters, which captures information across all the slices in the volume by applying semantic, positional, and slice attention on deep feature maps at different scales. Our extensive experiments using different network architectures and tasks demonstrate the usefulness and generalizability of CSAM. Associated code is available at https://github.com/aL3x-O-o-Hung/CSAM. △ Less

Submitted 26 November, 2023; v1 submitted 7 November, 2023; originally announced November 2023.

arXiv:2310.19022 [pdf, other]

doi 10.1109/TCYB.2023.3323316

Optimization Landscape of Policy Gradient Methods for Discrete-time Static Output Feedback

Authors: **gliang Duan, Jie Li, Xuyang Chen, Kai Zhao, Shengbo Eben Li, Lin Zhao

Abstract: In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This paper analyzes the opti… ▽ More In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This paper analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity, L-smoothness, and M-Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss-Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence towards local minima when initialized near such minima. The paper concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning. △ Less

Submitted 29 October, 2023; originally announced October 2023.

Journal ref: IEEE Transactions on Cybernetics, 2023

arXiv:2307.11926 [pdf, other]

PartDiff: Image Super-resolution with Partial Diffusion Models

Authors: Kai Zhao, Alex Ling Yu Hung, Kaifeng Pang, Haoxin Zheng, Kyunghyun Sung

Abstract: Denoising diffusion probabilistic models (DDPMs) have achieved impressive performance on various image generation tasks, including image super-resolution. By learning to reverse the process of gradually diffusing the data distribution into Gaussian noise, DDPMs generate new data by iteratively denoising from random noise. Despite their impressive performance, diffusion-based generative models suff… ▽ More Denoising diffusion probabilistic models (DDPMs) have achieved impressive performance on various image generation tasks, including image super-resolution. By learning to reverse the process of gradually diffusing the data distribution into Gaussian noise, DDPMs generate new data by iteratively denoising from random noise. Despite their impressive performance, diffusion-based generative models suffer from high computational costs due to the large number of denoising steps.In this paper, we first observed that the intermediate latent states gradually converge and become indistinguishable when diffusing a pair of low- and high-resolution images. This observation inspired us to propose the Partial Diffusion Model (PartDiff), which diffuses the image to an intermediate latent state instead of pure random noise, where the intermediate latent state is approximated by the latent of diffusing the low-resolution image. During generation, Partial Diffusion Models start denoising from the intermediate distribution and perform only a part of the denoising steps. Additionally, to mitigate the error caused by the approximation, we introduce "latent alignment", which aligns the latent between low- and high-resolution images during training. Experiments on both magnetic resonance imaging (MRI) and natural images show that, compared to plain diffusion-based super-resolution methods, Partial Diffusion Models significantly reduce the number of denoising steps without sacrificing the quality of generation. △ Less

Submitted 21 July, 2023; originally announced July 2023.

arXiv:2307.09729 [pdf, other]

NTIRE 2023 Quality Assessment of Video Enhancement Challenge

Authors: Xiaohong Liu, Xiongkuo Min, Wei Sun, Yulun Zhang, Kai Zhang, Radu Timofte, Guangtao Zhai, Yixuan Gao, Yuqin Cao, Tengchuan Kou, Yunlong Dong, Ziheng Jia, Yilin Li, Wei Wu, Shuming Hu, Sibin Deng, Pengxiang Xiao, Ying Chen, Kai Li, Kai Zhao, Kun Yuan, Ming Sun, Heng Cong, Hao Wang, Lingzhi Fu , et al. (47 additional authors not shown)

Abstract: This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual… ▽ More This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211 enhanced videos, including 600 videos with color, brightness, and contrast enhancements, 310 videos with deblurring, and 301 deshaked videos. The challenge has a total of 167 registered participants. 61 participating teams submitted their prediction results during the development phase, with a total of 3168 submissions. A total of 176 submissions were submitted by 37 participating teams during the final testing phase. Finally, 19 participating teams submitted their models and fact sheets, and detailed the methods they used. Some methods have achieved better results than baseline methods, and the winning methods have demonstrated superior prediction performance. △ Less

Submitted 18 July, 2023; originally announced July 2023.

arXiv:2306.02011 [pdf]

The contribution of T2 relaxation time to diffusion MRI quantification and its clinical implications: a hypothesis

Authors: Yi Xiang J Wang, Kai-Xuan Zhao, Fu-Zhao Ma, Ben-Heng Xiao

Abstract: Considering liver as the reference, that both fast diffusion (PF) and slow diffusion (Dslow) of the spleen are much underestimated is likely due to the MRI properties of the spleen such as the much longer T2 relaxation time. It is possible that longer T2 relaxation time partially mitigates the signal decay effect of various gradients on diffusion weighted image. This phenomenon will not be limited… ▽ More Considering liver as the reference, that both fast diffusion (PF) and slow diffusion (Dslow) of the spleen are much underestimated is likely due to the MRI properties of the spleen such as the much longer T2 relaxation time. It is possible that longer T2 relaxation time partially mitigates the signal decay effect of various gradients on diffusion weighted image. This phenomenon will not be limited to the spleen. Most liver tumors have a longer T2 relaxation time than their native normal tissue and this is considered to be associated with oedema. On the other hand, most tumors are measured with lower MRI diffusion (despite being oedematous). The reason why malignant tumors have lower diffusion value [apparent diffusion coefficient (ADC) and Dslow] are poorly understood but has been proposed to be related to a combination of higher cellularity, tissue disorganization, and increased extracellular space tortuosity. These explanations may be true, but it is also possible to that many tumors have MRI properties similar to the spleen such as longer T2 (relative to the liver) and these MRI properties may also contribute to the lower MRI measured ADC and Dslow . In other words, if we could hypothetically plant a piece of spleen tissue in the liver, MRI would recognize this planted spleen tissue as being similar to a tumor and measure it to have lower diffusion than the liver. △ Less

Submitted 3 June, 2023; originally announced June 2023.

arXiv:2305.00139 [pdf, other]

Leveraging Label Non-Uniformity for Node Classification in Graph Neural Networks

Authors: Feng Ji, See Hian Lee, Hanyang Meng, Kai Zhao, Jielong Yang, Wee Peng Tay

Abstract: In node classification using graph neural networks (GNNs), a typical model generates logits for different class labels at each node. A softmax layer often outputs a label prediction based on the largest logit. We demonstrate that it is possible to infer hidden graph structural information from the dataset using these logits. We introduce the key notion of label non-uniformity, which is derived fro… ▽ More In node classification using graph neural networks (GNNs), a typical model generates logits for different class labels at each node. A softmax layer often outputs a label prediction based on the largest logit. We demonstrate that it is possible to infer hidden graph structural information from the dataset using these logits. We introduce the key notion of label non-uniformity, which is derived from the Wasserstein distance between the softmax distribution of the logits and the uniform distribution. We demonstrate that nodes with small label non-uniformity are harder to classify correctly. We theoretically analyze how the label non-uniformity varies across the graph, which provides insights into boosting the model performance: increasing training samples with high non-uniformity or drop** edges to reduce the maximal cut size of the node set of small non-uniformity. These mechanisms can be easily added to a base GNN model. Experimental results demonstrate that our approach improves the performance of many benchmark base models. △ Less

Submitted 28 April, 2023; originally announced May 2023.

arXiv:2304.04366 [pdf, other]

Learning Residual Model of Model Predictive Control via Random Forests for Autonomous Driving

Authors: Kang Zhao, Jianru Xue, Xiangning Meng, Gengxin Li, Mengsen Wu

Abstract: One major issue in learning-based model predictive control (MPC) for autonomous driving is the contradiction between the system model's prediction accuracy and computation efficiency. The more situations a system model covers, the more complex it is, along with highly nonlinear and nonconvex properties. These issues make the optimization too complicated to solve and render real-time control imprac… ▽ More One major issue in learning-based model predictive control (MPC) for autonomous driving is the contradiction between the system model's prediction accuracy and computation efficiency. The more situations a system model covers, the more complex it is, along with highly nonlinear and nonconvex properties. These issues make the optimization too complicated to solve and render real-time control impractical.To address these issues, we propose a hierarchical learning residual model which leverages random forests and linear regression.The learned model consists of two levels. The low level uses linear regression to fit the residues, and the high level uses random forests to switch different linear models. Meanwhile, we adopt the linear dynamic bicycle model with error states as the nominal model.The switched linear regression model is added to the nominal model to form the system model. It reformulates the learning-based MPC as a quadratic program (QP) problem and optimization solvers can effectively solve it. Experimental path tracking results show that the driving vehicle's prediction accuracy and tracking accuracy are significantly improved compared with the nominal MPC.Compared with the state-of-the-art Gaussian process-based nonlinear model predictive control (GP-NMPC), our method gets better performance on tracking accuracy while maintaining a lower computation consumption. △ Less

Submitted 9 April, 2023; originally announced April 2023.

Comments: 8 pages, 8 figures

arXiv:2304.03507 [pdf, other]

Distributional Signals for Node Classification in Graph Neural Networks

Authors: Feng Ji, See Hian Lee, Kai Zhao, Wee Peng Tay, Jielong Yang

Abstract: In graph neural networks (GNNs), both node features and labels are examples of graph signals, a key notion in graph signal processing (GSP). While it is common in GSP to impose signal smoothness constraints in learning and estimation tasks, it is unclear how this can be done for discrete node labels. We bridge this gap by introducing the concept of distributional graph signals. In our framework, w… ▽ More In graph neural networks (GNNs), both node features and labels are examples of graph signals, a key notion in graph signal processing (GSP). While it is common in GSP to impose signal smoothness constraints in learning and estimation tasks, it is unclear how this can be done for discrete node labels. We bridge this gap by introducing the concept of distributional graph signals. In our framework, we work with the distributions of node labels instead of their values and propose notions of smoothness and non-uniformity of such distributional graph signals. We then propose a general regularization method for GNNs that allows us to encode distributional smoothness and non-uniformity of the model output in semi-supervised node classification tasks. Numerical experiments demonstrate that our method can significantly improve the performance of most base GNN models in different problem settings. △ Less

Submitted 7 April, 2023; originally announced April 2023.

arXiv:2301.05648 [pdf, ps, other]

Reconfigurable Intelligent Surface Empowered Rate-Splitting Multiple Access for Simultaneous Wireless Information and Power Transfer

Authors: Chengzhong Tian, Yijie Mao, Kangchun Zhao, Yuanming Shi, Bruno Clerckx

Abstract: Rate-splitting multiple access (RSMA) and reconfigurable intelligent surface (RIS) have been both recognized as promising techniques for 6G. The benefits of combining the two techniques to enhance the spectral and energy efficiency have been recently exploited in communication-only networks. Inspired by the recent advances on RIS empowered RSMA, in this work we investigate the use of RIS empowered… ▽ More Rate-splitting multiple access (RSMA) and reconfigurable intelligent surface (RIS) have been both recognized as promising techniques for 6G. The benefits of combining the two techniques to enhance the spectral and energy efficiency have been recently exploited in communication-only networks. Inspired by the recent advances on RIS empowered RSMA, in this work we investigate the use of RIS empowered RSMA for simultaneous wireless information and power transfer (SWIPT) with one transmitter concurrently sending information to multiple information receivers (IRs) and transferring energy to multiple energy receivers (ERs). Specifically, we jointly optimize the transmit beamformers and the RIS reflection coefficients to maximize the weighted sum-rate (WSR) of IRs under the harvested energy constraint of ERs and the transmit power constraint. An alternating optimization and successive convex approximation (SCA)-based optimization framework is then proposed to solve the problem. Numerical results show that by marrying the benefits of RSMA and RIS, the proposed RIS empowered RSMA achieves a better tradeoff between the WSR of IRs and energy harvested at ERs. Therefore, we conclude that RIS empowered RSMA is a promising strategy for SWIPT. △ Less

Submitted 13 January, 2023; originally announced January 2023.

arXiv:2212.09988 [pdf, other]

Multi-Reference Image Super-Resolution: A Posterior Fusion Approach

Authors: Ke Zhao, Haining Tan, Tsz Fung Yau

Abstract: Reference-based Super-resolution (RefSR) approaches have recently been proposed to overcome the ill-posed problem of image super-resolution by providing additional information from a high-resolution image. Multi-reference super-resolution extends this approach by allowing more information to be incorporated. This paper proposes a 2-step-weighting posterior fusion approach to combine the outputs of… ▽ More Reference-based Super-resolution (RefSR) approaches have recently been proposed to overcome the ill-posed problem of image super-resolution by providing additional information from a high-resolution image. Multi-reference super-resolution extends this approach by allowing more information to be incorporated. This paper proposes a 2-step-weighting posterior fusion approach to combine the outputs of RefSR models with multiple references. Extensive experiments on the CUFED5 dataset demonstrate that the proposed methods can be applied to various state-of-the-art RefSR models to get a consistent improvement in image quality. △ Less

Submitted 19 December, 2022; originally announced December 2022.

arXiv:2210.12715 [pdf, ps, other]

Adaptive Control with Global Exponential Stability for Parameter-Varying Nonlinear Systems under Unknown Control Gains

Authors: Hefu Ye, Haijia Wu, Kai Zhao, Yongduan Song

Abstract: It is nontrivial to achieve exponential stability even for time-invariant nonlinear systems with matched uncertainties and persistent excitation (PE) condition. In this paper, without the need for PE condition, we address the problem of global exponential stabilization of strict-feedback systems with mismatched uncertainties and unknown yet time-varying control gains. The resultant control, embedd… ▽ More It is nontrivial to achieve exponential stability even for time-invariant nonlinear systems with matched uncertainties and persistent excitation (PE) condition. In this paper, without the need for PE condition, we address the problem of global exponential stabilization of strict-feedback systems with mismatched uncertainties and unknown yet time-varying control gains. The resultant control, embedded with time-varying feedback gains, is capable of ensuring global exponential stability of parametric-strict-feedback systems in the absence of persistence of excitation. By using the enhanced Nussbaum function, the previous results are extended to more general nonlinear systems where the sign and magnitude of the time-varying control gain are unknown. In particular, the argument of the Nussbaum function is guaranteed to be always positive with the aid of nonlinear dam** design, which is critical to perform a straightforward technical analysis of the boundedness of the Nussbaum function. Finally, the global exponential stability of parameter-varying strict-feedback systems, the boundedness of the control input and the update rate, and the asymptotic constancy of the parameter estimate are established. Numerical simulations are carried out to verify the effectiveness and benefits of the proposed methods. △ Less

Submitted 23 October, 2022; originally announced October 2022.

arXiv:2204.03238 [pdf, other]

Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis

Authors: Yutian Wang, Yuankun Xie, Kun Zhao, Hui Wang, Qin Zhang

Abstract: In this paper, we propose a novel prosody disentangle method for prosodic Text-to-Speech (TTS) model, which introduces the vector quantization (VQ) method to the auxiliary prosody encoder to obtain the decomposed prosody representations in an unsupervised manner. Rely on its advantages, the speaking styles, such as pitch, speaking velocity, local pitch variance, etc., are decomposed automatically… ▽ More In this paper, we propose a novel prosody disentangle method for prosodic Text-to-Speech (TTS) model, which introduces the vector quantization (VQ) method to the auxiliary prosody encoder to obtain the decomposed prosody representations in an unsupervised manner. Rely on its advantages, the speaking styles, such as pitch, speaking velocity, local pitch variance, etc., are decomposed automatically into the latent quantize vectors. We also investigate the internal mechanism of VQ disentangle process by means of a latent variables counter and find that higher value dimensions usually represent prosody information. Experiments show that our model can control the speaking styles of synthesis results by directly manipulating the latent variables. The objective and subjective evaluations illustrated that our model outperforms the popular models. △ Less

Submitted 7 April, 2022; originally announced April 2022.

Comments: accepted by IEEE International Conference on Multimedia and Expo 2022 (ICME2022)

arXiv:2203.14291 [pdf, other]

doi 10.1007/s11633-022-1371-y

Video Polyp Segmentation: A Deep Learning Perspective

Authors: Ge-Peng Ji, Guobao Xiao, Yu-Cheng Chou, Deng-** Fan, Kai Zhao, Geng Chen, Luc Van Gool

Abstract: We present the first comprehensive video polyp segmentation (VPS) study in the deep learning era. Over the years, developments in VPS are not moving forward with ease due to the lack of large-scale fine-grained segmentation annotations. To address this issue, we first introduce a high-quality frame-by-frame annotated VPS dataset, named SUN-SEG, which contains 158,690 colonoscopy frames from the we… ▽ More We present the first comprehensive video polyp segmentation (VPS) study in the deep learning era. Over the years, developments in VPS are not moving forward with ease due to the lack of large-scale fine-grained segmentation annotations. To address this issue, we first introduce a high-quality frame-by-frame annotated VPS dataset, named SUN-SEG, which contains 158,690 colonoscopy frames from the well-known SUN-database. We provide additional annotations with diverse types, i.e., attribute, object mask, boundary, scribble, and polygon. Second, we design a simple but efficient baseline, dubbed PNS+, consisting of a global encoder, a local encoder, and normalized self-attention (NS) blocks. The global and local encoders receive an anchor frame and multiple successive frames to extract long-term and short-term spatial-temporal representations, which are then progressively updated by two NS blocks. Extensive experiments show that PNS+ achieves the best performance and real-time inference speed (170fps), making it a promising solution for the VPS task. Third, we extensively evaluate 13 representative polyp/object segmentation models on our SUN-SEG dataset and provide attribute-based comparisons. Finally, we discuss several open issues and suggest possible research directions for the VPS community. △ Less

Submitted 31 August, 2022; v1 submitted 27 March, 2022; originally announced March 2022.

Comments: Accepted by Machine Intelligence Research 2022 (Project Page: https://github.com/GewelsJI/VPS)

Journal ref: Machine Intelligence Research, vol. 19, no. 6, pp.531-549, 2022

arXiv:2112.09891 [pdf, other]

Equilibrated Zeroth-Order Unrolled Deep Networks for Accelerated MRI

Authors: Zhuo-Xu Cui, **g Cheng, Qingyong Zhu, Yuanyuan Liu, Sen Jia, Kankan Zhao, Ziwen Ke, Wenqi Huang, Haifeng Wang, Yanjie Zhu, Dong Liang

Abstract: Recently, model-driven deep learning unrolls a certain iterative algorithm of a regularization model into a cascade network by replacing the first-order information (i.e., (sub)gradient or proximal operator) of the regularizer with a network module, which appears more explainable and predictable compared to common data-driven networks. Conversely, in theory, there is not necessarily such a functio… ▽ More Recently, model-driven deep learning unrolls a certain iterative algorithm of a regularization model into a cascade network by replacing the first-order information (i.e., (sub)gradient or proximal operator) of the regularizer with a network module, which appears more explainable and predictable compared to common data-driven networks. Conversely, in theory, there is not necessarily such a functional regularizer whose first-order information matches the replaced network module, which means the network output may not be covered by the original regularization model. Moreover, up to now, there is also no theory to guarantee the global convergence and robustness (regularity) of unrolled networks under realistic assumptions. To bridge this gap, this paper propose to present a safeguarded methodology on network unrolling. Specifically, focusing on accelerated MRI, we unroll a zeroth-order algorithm, of which the network module represents the regularizer itself, so that the network output can be still covered by the regularization model. Furthermore, inspired by the ideal of deep equilibrium models, before backpropagating, we carry out the unrolled iterative network to converge to a fixed point to ensure the convergence. In case the measurement data contains noise, we prove that the proposed network is robust against noisy interference. Finally, numerical experiments show that the proposed network consistently outperforms the state-of-the-art MRI reconstruction methods including traditional regularization methods and other deep learning methods. △ Less

Submitted 22 December, 2021; v1 submitted 18 December, 2021; originally announced December 2021.

Comments: 11 figures

arXiv:2103.13197 [pdf, other]

Topology Design for GNSSs Considering Both Inter-satellite Links and Ground-satellite Links

Authors: Z. Yan, K. Zhao, W. Li, C. Kang, J. Zheng, H. Yang, S. Du

Abstract: Inter-satellite links (ISLs) are adopted in global navigation satellite systems (GNSSs) for high-precision orbit determination and space-based end-to-end telemetry telecommand control and communications. Due to limited onboard ISL terminals, the polling time division duplex (PTDD) mechanism is usually proposed for space link layer networking. By extending the polling mechanism to ground-satellite… ▽ More Inter-satellite links (ISLs) are adopted in global navigation satellite systems (GNSSs) for high-precision orbit determination and space-based end-to-end telemetry telecommand control and communications. Due to limited onboard ISL terminals, the polling time division duplex (PTDD) mechanism is usually proposed for space link layer networking. By extending the polling mechanism to ground-satellite links (GSLs), a unified management system of the space segment and the ground segment can be realized. However, under the polling system how to jointly design the topology of ISLs and GSLs during every slot to improve data interaction has not been studied. In this paper, we formulate the topology design problem as an integer linear programming, aiming at minimizing the average delay of data delivery from satellites to ground stations while satisfying the ranging requirement for the orbit determination. To tackle the computational complexity problem, we first present a novel modeling method of delay to reduce the number of decision variables. Further, we propose a more efficient heuristic based on maximum weight matching algorithms. Simulation results demonstrate the feasibility of the proposed methods for practical operation in GNSSs. Comparing the two methods, the heuristic can achieve similar performance with respect to average delay but with significantly less complexity. △ Less

Submitted 24 March, 2021; originally announced March 2021.

arXiv:2103.09300 [pdf]

The impact of data volume on performance of deep learning based building rooftop extraction using very high spatial resolution aerial images

Authors: Hongjie He, Ke Yang, Yuwei Cai, Zijian Jiang, Qiutong Yu, Kun Zhao, Junbo Wang, Sarah Narges Fatholahi, Yan Liu, Hasti Andon Petrosians, Bingxu Hu, Liyuan Qing, Zhehan Zhang, Hongzhang Xu, Siyu Li, Kyle Gao, Linlin Xu, Jonathan Li

Abstract: Building rooftop data are of importance in several urban applications and in natural disaster management. In contrast to traditional surveying and map**, by using high spatial resolution aerial images, deep learning-based building rooftops extraction methods are efficient and accurate. Although more training data is preferred in deep learning-based tasks, the effect of data volume on building ex… ▽ More Building rooftop data are of importance in several urban applications and in natural disaster management. In contrast to traditional surveying and map**, by using high spatial resolution aerial images, deep learning-based building rooftops extraction methods are efficient and accurate. Although more training data is preferred in deep learning-based tasks, the effect of data volume on building extraction models is underexplored. Therefore, the paper explores the impact of data volume on the performance of building rooftop extraction from very-high-spatial-resolution (VHSR) images using deep learning-based methods. To do so, we manually labelled 0.12m spatial resolution aerial images and perform a comparative analysis of models trained on datasets of different sizes using popular deep learning architectures for segmentation tasks, including Fully Convolutional Networks (FCN)-8s, U-Net and DeepLabv3+. The experiments showed that with more training data, algorithms converged faster and achieved higher accuracy, while better algorithms were able to better mitigate the lack of training data. △ Less

Submitted 4 October, 2021; v1 submitted 16 March, 2021; originally announced March 2021.

arXiv:2009.09761 [pdf, other]

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Authors: Zhifeng Kong, Wei **, Jiaji Huang, Kexin Zhao, Bryan Catanzaro

Abstract: In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave p… ▽ More In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations. △ Less

Submitted 30 March, 2021; v1 submitted 21 September, 2020; originally announced September 2020.

Comments: ICLR 2021 (oral)

arXiv:2005.05898 [pdf, other]

doi 10.1109/ICASSP40776.2020.9053100

Learning to Estimate Driver Drowsiness from Car Acceleration Sensors using Weakly Labeled Data

Authors: Takayuki Katsuki, Kun Zhao, Takayuki Yoshizumi

Abstract: This paper addresses the learning task of estimating driver drowsiness from the signals of car acceleration sensors. Since even drivers themselves cannot perceive their own drowsiness in a timely manner unless they use burdensome invasive sensors, obtaining labeled training data for each timestamp is not a realistic goal. To deal with this difficulty, we formulate the task as a weakly supervised l… ▽ More This paper addresses the learning task of estimating driver drowsiness from the signals of car acceleration sensors. Since even drivers themselves cannot perceive their own drowsiness in a timely manner unless they use burdensome invasive sensors, obtaining labeled training data for each timestamp is not a realistic goal. To deal with this difficulty, we formulate the task as a weakly supervised learning. We only need to add labels for each complete trip, not for every timestamp independently. By assuming that some aspects of driver drowsiness increase over time due to tiredness, we formulate an algorithm that can learn from such weakly labeled data. We derive a scalable stochastic optimization method as a way of implementing the algorithm. Numerical experiments on real driving datasets demonstrate the advantages of our algorithm against baseline methods. △ Less

Submitted 12 May, 2020; originally announced May 2020.

Comments: Accepted by ICASSP2020

arXiv:1912.01219 [pdf, other]

WaveFlow: A Compact Flow-based Model for Raw Audio

Authors: Wei **, Kainan Peng, Kexin Zhao, Zhao Song

Abstract: In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow provides a unified view of likelihood-based models for 1-D data, including Wav… ▽ More In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases. It generates high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps. Furthermore, it can significantly reduce the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has only 5.91M parameters, which is 15$\times$ smaller than WaveGlow. It can generate 22.05 kHz high-fidelity audio 42.6$\times$ faster than real-time (at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels. △ Less

Submitted 24 June, 2020; v1 submitted 3 December, 2019; originally announced December 2019.

Comments: Published at ICML 2020. Code and pre-trained models: https://github.com/PaddlePaddle/Parakeet

arXiv:1907.06844 [pdf, other]

Deep inspection: an electrical distribution pole parts study via deep neural networks

Authors: Liangchen Liu, Teng Zhang, Kun Zhao, Arnold Wiliem, Kieren Astin-Walmsley, Brian Lovell

Abstract: Electrical distribution poles are important assets in electricity supply. These poles need to be maintained in good condition to ensure they protect community safety, maintain reliability of supply, and meet legislative obligations. However, maintaining such a large volumes of assets is an expensive and challenging task. To address this, recent approaches utilise imagery data captured from helicop… ▽ More Electrical distribution poles are important assets in electricity supply. These poles need to be maintained in good condition to ensure they protect community safety, maintain reliability of supply, and meet legislative obligations. However, maintaining such a large volumes of assets is an expensive and challenging task. To address this, recent approaches utilise imagery data captured from helicopter and/or drone inspections. Whilst reducing the cost for manual inspection, manual analysis on each image is still required. As such, several image-based automated inspection systems have been proposed. In this paper, we target two major challenges: tiny object detection and extremely imbalanced datasets, which currently hinder the wide deployment of the automatic inspection. We propose a novel two-stage zoom-in detection method to gradually focus on the object of interest. To address the imbalanced dataset problem, we propose the resampling as well as reweighting schemes to iteratively adapt the model to the large intra-class variation of major class and balance the contributions to the loss from each class. Finally, we integrate these components together and devise a novel automatic inspection framework. Extensive experiments demonstrate that our proposed approaches are effective and can boost the performance compared to the baseline methods. △ Less

Submitted 16 July, 2019; originally announced July 2019.

Comments: electrical distribution pole inspection, integrated inspection system, object detection, imbalanced data classification, To appear in Proceeding of ICIP 2019

arXiv:1907.04462 [pdf, other]

Multi-Speaker End-to-End Speech Synthesis

Authors: Jihyun Park, Kexin Zhao, Kainan Peng, Wei **

Abstract: In this work, we extend ClariNet (** et al., 2019), a fully end-to-end speech synthesis model (i.e., text-to-wave), to generate high-fidelity speech from multiple speakers. To model the unique characteristic of different voices, low dimensional trainable speaker embeddings are shared across each component of ClariNet and trained together with the rest of the model. We demonstrate that the multi-… ▽ More In this work, we extend ClariNet (** et al., 2019), a fully end-to-end speech synthesis model (i.e., text-to-wave), to generate high-fidelity speech from multiple speakers. To model the unique characteristic of different voices, low dimensional trainable speaker embeddings are shared across each component of ClariNet and trained together with the rest of the model. We demonstrate that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner. △ Less

Submitted 9 July, 2019; originally announced July 2019.

arXiv:1905.08459 [pdf, other]

Non-Autoregressive Neural Text-to-Speech

Authors: Kainan Peng, Wei **, Zhao Song, Kexin Zhao

Abstract: In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a la… ▽ More In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system and test various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work. △ Less

Submitted 29 June, 2020; v1 submitted 21 May, 2019; originally announced May 2019.

Comments: Published at ICML 2020. (v3 changed paper title)

arXiv:1806.09068 [pdf]

Prototype of Front-end Electronics for PandaX-4ton Experiment

Authors: Shuwen Wang, Zhongtao Shen, Keqing Zhao, Changqing Feng, Shubin Liu

Abstract: At the China **** Underground Laboratory, the Particle AND Astrophysical Xenon phase IV (PandaX-4ton) in planning is a dark matter direct detection experiment with dual-phase xenon detector as an upgrade of the second phase of the experiment, PandaX-II. In this paper, the prototype of the front-end electronics of PandaX-4ton is presented. The front-end electronics consist of the high-gain pream… ▽ More At the China **** Underground Laboratory, the Particle AND Astrophysical Xenon phase IV (PandaX-4ton) in planning is a dark matter direct detection experiment with dual-phase xenon detector as an upgrade of the second phase of the experiment, PandaX-II. In this paper, the prototype of the front-end electronics of PandaX-4ton is presented. The front-end electronics consist of the high-gain preamplifier cards and the eight-channel digitizers with 14-bit resolution and 1 GSps sampling rate for waveform digitization. The clock synchronization circuit within the digitizer is well-designed to align all the PMT channels. The digitizer also contains gigabit fiber to exchange data with trigger and data acquisition system. The specification of effective number of bits f the digitizer is about 9.7 b at 148 MHz, and the integral nonlinearity of the digitizer ranges from -4 least significant bit (LSB) to +4 LSB, and the differential nonlinearity ranges from -0.6 LSB to +0.6 LSB. The performance of the front-end electronics can meet the requirements for the PandaX-4ton. △ Less

Submitted 23 June, 2018; originally announced June 2018.

arXiv:1710.09026 [pdf, other]

Trace norm regularization and faster inference for embedded speech recognition RNNs

Authors: Markus Kliegl, Siddharth Goyal, Kexin Zhao, Kavya Srinet, Mohammad Shoeybi

Abstract: We propose and evaluate new techniques for compressing and speeding up dense matrix multiplications as found in the fully connected and recurrent layers of neural networks for embedded large vocabulary continuous speech recognition (LVCSR). For compression, we introduce and study a trace norm regularization technique for training low rank factored versions of matrix multiplications. Compared to st… ▽ More We propose and evaluate new techniques for compressing and speeding up dense matrix multiplications as found in the fully connected and recurrent layers of neural networks for embedded large vocabulary continuous speech recognition (LVCSR). For compression, we introduce and study a trace norm regularization technique for training low rank factored versions of matrix multiplications. Compared to standard low rank training, we show that our method leads to good accuracy versus number of parameter trade-offs and can be used to speed up training of large models. For speedup, we enable faster inference on ARM processors through new open sourced kernels optimized for small batch sizes, resulting in 3x to 7x speed ups over the widely used gemmlowp library. Beyond LVCSR, we expect our techniques and kernels to be more generally applicable to embedded neural networks with large fully connected or recurrent layers. △ Less

Submitted 6 February, 2018; v1 submitted 24 October, 2017; originally announced October 2017.

Comments: Our optimized inference kernels are available at: https://github.com/PaddlePaddle/farm (Note: This paper was submitted to, but rejected from, ICLR 2018. We believe it may still be of value to others. Please see the discussion here: https://openreview.net/forum?id=B1tC-LT6W)

Showing 1–30 of 30 results for author: Zhao, K