Search | arXiv e-print repository

arXiv:2406.07498 [pdf, other]

RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attention

Authors: Mingshuai Liu, Zhuangqi Chen, Xiaopeng Yan, Yuanjun Lv, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

Abstract: In real-time speech communication systems, speech signals are often degraded by multiple distortions. Recently, a two-stage Repair-and-Denoising network (RaD-Net) was proposed with superior speech quality improvement in the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. However, failure to use future information and constraint receptive field of convolution layers limit the system's perfor… ▽ More In real-time speech communication systems, speech signals are often degraded by multiple distortions. Recently, a two-stage Repair-and-Denoising network (RaD-Net) was proposed with superior speech quality improvement in the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. However, failure to use future information and constraint receptive field of convolution layers limit the system's performance. To mitigate these problems, we extend RaD-Net to its upgraded version, RaD-Net 2. Specifically, a causality-based knowledge distillation is introduced in the first stage to use future information in a causal way. We use the non-causal repairing network as the teacher to improve the performance of the causal repairing network. In addition, in the second stage, complex axial self-attention is applied in the denoising network's complex feature encoder/decoder. Experimental results on the ICASSP 2024 SSI Challenge blind test set show that RaD-Net 2 brings 0.10 OVRL DNSMOS improvement compared to RaD-Net. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2405.20336 [pdf, other]

RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text

Authors: Jiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, Chuang Gan

Abstract: In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rap** vocals, lyrics, and high-quality 3D holistic bo… ▽ More In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rap** vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information, and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation. The project page is available for research purposes at https://vis-www.cs.umass.edu/RapVerse. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: Project website: https://vis-www.cs.umass.edu/RapVerse

arXiv:2404.17125 [pdf, other]

doi 10.1109/ICPSAsia48933.2020.9208421

Misaka: Interactive Swarm Testbed for Smart Grid Distributed Algorithm Test and Evaluation

Authors: Tingliang Zhang, Haiwang Zhong, Zhenfei Tan, Xinfei Yan

Abstract: In this paper, we present Misaka, a visualized swarm testbed for smart grid algorithm evaluation, also an extendable open-source open-hardware platform for develo** tabletop tangible swarm interfaces. The platform consists of a collection of custom-designed 3 omni-directional wheels robots each 10 cm in diameter, high accuracy localization through a microdot pattern overlaid on top of the activi… ▽ More In this paper, we present Misaka, a visualized swarm testbed for smart grid algorithm evaluation, also an extendable open-source open-hardware platform for develo** tabletop tangible swarm interfaces. The platform consists of a collection of custom-designed 3 omni-directional wheels robots each 10 cm in diameter, high accuracy localization through a microdot pattern overlaid on top of the activity sheets, and a software framework for application development and control, while remaining affordable (per unit cost about 30 USD at the prototype stage). We illustrate the potential of tabletop swarm user interfaces through a set of smart grid algorithm application scenarios developed with Misaka. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Journal ref: 2020 IEEE/IAS Industrial and Commercial Power System Asia (I&CPS Asia)

arXiv:2404.16476 [pdf, ps, other]

A Novel Channel Coding Scheme for Digital Multiple Access Computing

Authors: Xiao**g Yan, Saeed Razavikia, Carlo Fischione

Abstract: In this paper, we consider the ChannelComp framework, which facilitates the computation of desired functions by multiple transmitters over a common receiver using digital modulations across a multiple access channel. While ChannelComp currently offers a broad framework for computation by designing digital constellations for over-the-air computation and employing symbol-level encoding, encoding the… ▽ More In this paper, we consider the ChannelComp framework, which facilitates the computation of desired functions by multiple transmitters over a common receiver using digital modulations across a multiple access channel. While ChannelComp currently offers a broad framework for computation by designing digital constellations for over-the-air computation and employing symbol-level encoding, encoding the repeated transmissions of the same symbol and using the corresponding received sequence may significantly improve the computation performance and reduce the encoding complexity. In this paper, we propose an enhancement involving the encoding of the repetitive transmission of the same symbol at each transmitter over multiple time slots and the design of constellation diagrams, with the aim of minimizing computational errors. We frame this enhancement as an optimization problem, which jointly identifies the constellation diagram and the channel code for repetition, which we call ReChCompCode. To manage the computational complexity of the optimization, we divide it into two tractable subproblems. Through numerical experiments, we evaluate the performance of ReChCompCode. The simulation results reveal that ReChCompCode can reduce the computation error by approximately up to 30 dB compared to standard ChannelComp, particularly for product functions. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: accepted version to the IEEE 2024 ICC conference

arXiv:2404.10343 [pdf, other]

The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/. △ Less

Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

arXiv:2404.09226 [pdf, other]

Breast Cancer Image Classification Method Based on Deep Transfer Learning

Authors: Weimin Wang, Min Gao, Mingxuan Xiao, Xu Yan, Yufeng Li

Abstract: To address the issues of limited samples, time-consuming feature design, and low accuracy in detection and classification of breast cancer pathological images, a breast cancer image classification model algorithm combining deep learning and transfer learning is proposed. This algorithm is based on the DenseNet structure of deep neural networks, and constructs a network model by introducing attenti… ▽ More To address the issues of limited samples, time-consuming feature design, and low accuracy in detection and classification of breast cancer pathological images, a breast cancer image classification model algorithm combining deep learning and transfer learning is proposed. This algorithm is based on the DenseNet structure of deep neural networks, and constructs a network model by introducing attention mechanisms, and trains the enhanced dataset using multi-level transfer learning. Experimental results demonstrate that the algorithm achieves an efficiency of over 84.0\% in the test set, with a significantly improved classification accuracy compared to previous models, making it applicable to medical breast cancer detection tasks. △ Less

Submitted 14 April, 2024; originally announced April 2024.

arXiv:2404.08713 [pdf, other]

Survival Prediction Across Diverse Cancer Types Using Neural Networks

Authors: Xu Yan, Weimin Wang, MingXuan Xiao, Yufeng Li, Min Gao

Abstract: Gastric cancer and Colon adenocarcinoma represent widespread and challenging malignancies with high mortality rates and complex treatment landscapes. In response to the critical need for accurate prognosis in cancer patients, the medical community has embraced the 5-year survival rate as a vital metric for estimating patient outcomes. This study introduces a pioneering approach to enhance survival… ▽ More Gastric cancer and Colon adenocarcinoma represent widespread and challenging malignancies with high mortality rates and complex treatment landscapes. In response to the critical need for accurate prognosis in cancer patients, the medical community has embraced the 5-year survival rate as a vital metric for estimating patient outcomes. This study introduces a pioneering approach to enhance survival prediction models for gastric and Colon adenocarcinoma patients. Leveraging advanced image analysis techniques, we sliced whole slide images (WSI) of these cancers, extracting comprehensive features to capture nuanced tumor characteristics. Subsequently, we constructed patient-level graphs, encapsulating intricate spatial relationships within tumor tissues. These graphs served as inputs for a sophisticated 4-layer graph convolutional neural network (GCN), designed to exploit the inherent connectivity of the data for comprehensive analysis and prediction. By integrating patients' total survival time and survival status, we computed C-index values for gastric cancer and Colon adenocarcinoma, yielding 0.57 and 0.64, respectively. Significantly surpassing previous convolutional neural network models, these results underscore the efficacy of our approach in accurately predicting patient survival outcomes. This research holds profound implications for both the medical and AI communities, offering insights into cancer biology and progression while advancing personalized treatment strategies. Ultimately, our study represents a significant stride in leveraging AI-driven methodologies to revolutionize cancer prognosis and improve patient outcomes on a global scale. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.08279 [pdf, other]

Convolutional neural network classification of cancer cytopathology images: taking breast cancer as an example

Authors: MingXuan Xiao, Yufeng Li, Xu Yan, Min Gao, Weimin Wang

Abstract: Breast cancer is a relatively common cancer among gynecological cancers. Its diagnosis often relies on the pathology of cells in the lesion. The pathological diagnosis of breast cancer not only requires professionals and time, but also sometimes involves subjective judgment. To address the challenges of dependence on pathologists expertise and the time-consuming nature of achieving accurate breast… ▽ More Breast cancer is a relatively common cancer among gynecological cancers. Its diagnosis often relies on the pathology of cells in the lesion. The pathological diagnosis of breast cancer not only requires professionals and time, but also sometimes involves subjective judgment. To address the challenges of dependence on pathologists expertise and the time-consuming nature of achieving accurate breast pathological image classification, this paper introduces an approach utilizing convolutional neural networks (CNNs) for the rapid categorization of pathological images, aiming to enhance the efficiency of breast pathological image detection. And the approach enables the rapid and automatic classification of pathological images into benign and malignant groups. The methodology involves utilizing a convolutional neural network (CNN) model leveraging the Inceptionv3 architecture and transfer learning algorithm for extracting features from pathological images. Utilizing a neural network with fully connected layers and employing the SoftMax function for image classification. Additionally, the concept of image partitioning is introduced to handle high-resolution images. To achieve the ultimate classification outcome, the classification probabilities of each image block are aggregated using three algorithms: summation, product, and maximum. Experimental validation was conducted on the BreaKHis public dataset, resulting in accuracy rates surpassing 0.92 across all four magnification coefficients (40X, 100X, 200X, and 400X). It demonstrates that the proposed method effectively enhances the accuracy in classifying pathological images of breast cancer. △ Less

Submitted 12 April, 2024; originally announced April 2024.

arXiv:2404.05217 [pdf, other]

Network-Constrained Unit Commitment with Flexible Temporal Resolution

Authors: Zekuan Yu, Haiwang Zhong, Guangchun Ruan, Xinfei Yan

Abstract: Modern network-constrained unit commitment (NCUC) bears a heavy computational burden due to the ever-growing model scale. This situation becomes more challenging when detailed operational characteristics, complicated constraints, and multiple objectives are considered. We propose a novel simplification method to determine the flexible temporal resolution for acceleration and near-optimal solutions… ▽ More Modern network-constrained unit commitment (NCUC) bears a heavy computational burden due to the ever-growing model scale. This situation becomes more challenging when detailed operational characteristics, complicated constraints, and multiple objectives are considered. We propose a novel simplification method to determine the flexible temporal resolution for acceleration and near-optimal solutions. The flexible temporal resolution is determined by analyzing the impact on generators in each adaptive time period with awareness of congestion effects. Additionally, multiple improvements are employed on the existing NCUC model compatible with flexible temporal resolution to reduce the number of integer variables while preserving the original features. A case study using the IEEE 118-bus and the Polish 2736-bus systems verifies that the proposed method achieves substantial acceleration with low cost variation and high accuracy. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: 11 pages, 10 figures. Accepted by IEEE Transactions on Power Systems

arXiv:2404.05149 [pdf, other]

Intelligent Reflecting Surface Aided Target Localization With Unknown Transceiver-IRS Channel State Information

Authors: Taotao Ji, Meng Hua, Xuanhong Yan, Chunguo Li, Yongming Huang, Luxi Yang

Abstract: Integrating wireless sensing capabilities into base stations (BSs) has become a widespread trend in the future beyond fifth-generation (B5G)/sixth-generation (6G) wireless networks. In this paper, we investigate intelligent reflecting surface (IRS) enabled wireless localization, in which an IRS is deployed to assist a BS in locating a target in its non-line-of-sight (NLoS) region. In particular, w… ▽ More Integrating wireless sensing capabilities into base stations (BSs) has become a widespread trend in the future beyond fifth-generation (B5G)/sixth-generation (6G) wireless networks. In this paper, we investigate intelligent reflecting surface (IRS) enabled wireless localization, in which an IRS is deployed to assist a BS in locating a target in its non-line-of-sight (NLoS) region. In particular, we consider the case where the BS-IRS channel state information (CSI) is unknown. Specifically, we first propose a separate BS-IRS channel estimation scheme in which the BS operates in full-duplex mode (FDM), i.e., a portion of the BS antennas send downlink pilot signals to the IRS, while the remaining BS antennas receive the uplink pilot signals reflected by the IRS. However, we can only obtain an incomplete BS-IRS channel matrix based on our developed iterative coordinate descent-based channel estimation algorithm due to the "sign ambiguity issue". Then, we employ the multiple hypotheses testing framework to perform target localization based on the incomplete estimated channel, in which the probability of each hypothesis is updated using Bayesian inference at each cycle. Moreover, we formulate a joint BS transmit waveform and IRS phase shifts optimization problem to improve the target localization performance by maximizing the weighted sum distance between each two hypotheses. However, the objective function is essentially a quartic function of the IRS phase shift vector, thus motivating us to resort to the penalty-based method to tackle this challenge. Simulation results validate the effectiveness of our proposed target localization scheme and show that the scheme's performance can be further improved by finely designing the BS transmit waveform and IRS phase shifts intending to maximize the weighted sum distance between different hypotheses. △ Less

Submitted 7 April, 2024; originally announced April 2024.

arXiv:2402.15738 [pdf, other]

Privacy-Preserving State Estimation in the Presence of Eavesdroppers: A Survey

Authors: Xinhao Yan, Guanzhong Zhou, Daniel E. Quevedo, Carlos Murguia, Bo Chen, Hailong Huang

Abstract: Networked systems are increasingly the target of cyberattacks that exploit vulnerabilities within digital communications, embedded hardware, and software. Arguably, the simplest class of attacks -- and often the first type before launching destructive integrity attacks -- are eavesdrop** attacks, which aim to infer information by collecting system data and exploiting it for malicious purposes. A… ▽ More Networked systems are increasingly the target of cyberattacks that exploit vulnerabilities within digital communications, embedded hardware, and software. Arguably, the simplest class of attacks -- and often the first type before launching destructive integrity attacks -- are eavesdrop** attacks, which aim to infer information by collecting system data and exploiting it for malicious purposes. A key technology of networked systems is state estimation, which leverages sensing and actuation data and first-principles models to enable trajectory planning, real-time monitoring, and control. However, state estimation can also be exploited by eavesdroppers to identify models and reconstruct states with the aim of, e.g., launching integrity (stealthy) attacks and inferring sensitive information. It is therefore crucial to protect disclosed system data to avoid an accurate state estimation by eavesdroppers. This survey presents a comprehensive review of existing literature on privacy-preserving state estimation methods, while also identifying potential limitations and research gaps. Our primary focus revolves around three types of methods: cryptography, data perturbation, and transmission scheduling, with particular emphasis on Kalman-like filters. Within these categories, we delve into the concepts of homomorphic encryption and differential privacy, which have been extensively investigated in recent years in the context of privacy-preserving state estimation. Finally, we shed light on several technical and fundamental challenges surrounding current methods and propose potential directions for future research. △ Less

Submitted 24 February, 2024; originally announced February 2024.

Comments: 16 pages, 5 figures, 4 tables

arXiv:2402.09372 [pdf, other]

Deep Rib Fracture Instance Segmentation and Classification from CT on the RibFrac Challenge

Authors: Jiancheng Yang, Rui Shi, Liang **, Xiaoyang Huang, Kaiming Kuang, Donglai Wei, Shixuan Gu, Jianying Liu, Pengfei Liu, Zhizhong Chai, Yongjie Xiao, Hao Chen, Liming Xu, Bang Du, Xiangyi Yan, Hao Tang, Adam Alessio, Gregory Holste, Jiapeng Zhang, Xiaoming Wang, Jianye He, Lixuan Che, Hanspeter Pfister, Ming Li, Bingbing Ni

Abstract: Rib fractures are a common and potentially severe injury that can be challenging and labor-intensive to detect in CT scans. While there have been efforts to address this field, the lack of large-scale annotated datasets and evaluation benchmarks has hindered the development and validation of deep learning algorithms. To address this issue, the RibFrac Challenge was introduced, providing a benchmar… ▽ More Rib fractures are a common and potentially severe injury that can be challenging and labor-intensive to detect in CT scans. While there have been efforts to address this field, the lack of large-scale annotated datasets and evaluation benchmarks has hindered the development and validation of deep learning algorithms. To address this issue, the RibFrac Challenge was introduced, providing a benchmark dataset of over 5,000 rib fractures from 660 CT scans, with voxel-level instance mask annotations and diagnosis labels for four clinical categories (buckle, nondisplaced, displaced, or segmental). The challenge includes two tracks: a detection (instance segmentation) track evaluated by an FROC-style metric and a classification track evaluated by an F1-style metric. During the MICCAI 2020 challenge period, 243 results were evaluated, and seven teams were invited to participate in the challenge summary. The analysis revealed that several top rib fracture detection solutions achieved performance comparable or even better than human experts. Nevertheless, the current rib fracture classification solutions are hardly clinically applicable, which can be an interesting area in the future. As an active benchmark and research resource, the data and online evaluation of the RibFrac Challenge are available at the challenge website. As an independent contribution, we have also extended our previous internal baseline by incorporating recent advancements in large-scale pretrained networks and point-based rib segmentation techniques. The resulting FracNet+ demonstrates competitive performance in rib fracture detection, which lays a foundation for further research and development in AI-assisted rib fracture detection and diagnosis. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: Challenge paper for MICCAI RibFrac Challenge (https://ribfrac.grand-challenge.org/)

arXiv:2402.09101 [pdf, other]

DestripeCycleGAN: Stripe Simulation CycleGAN for Unsupervised Infrared Image Destri**

Authors: Shiqi Yang, Hanlin Qin, Shuai Yuan, Xiang Yan, Hossein Rahmani

Abstract: CycleGAN has been proven to be an advanced approach for unsupervised image restoration. This framework consists of two generators: a denoising one for inference and an auxiliary one for modeling noise to fulfill cycle-consistency constraints. However, when applied to the infrared destri** task, it becomes challenging for the vanilla auxiliary generator to consistently produce vertical noise unde… ▽ More CycleGAN has been proven to be an advanced approach for unsupervised image restoration. This framework consists of two generators: a denoising one for inference and an auxiliary one for modeling noise to fulfill cycle-consistency constraints. However, when applied to the infrared destri** task, it becomes challenging for the vanilla auxiliary generator to consistently produce vertical noise under unsupervised constraints. This poses a threat to the effectiveness of the cycle-consistency loss, leading to stripe noise residual in the denoised image. To address the above issue, we present a novel framework for single-frame infrared image destri**, named DestripeCycleGAN. In this model, the conventional auxiliary generator is replaced with a priori stripe generation model (SGM) to introduce vertical stripe noise in the clean data, and the gradient map is employed to re-establish cycle-consistency. Meanwhile, a Haar wavelet background guidance module (HBGM) has been designed to minimize the divergence of background details between the different domains. To preserve vertical edges, a multi-level wavelet U-Net (MWUNet) is proposed as the denoising generator, which utilizes the Haar wavelet transform as the sampler to decline directional information loss. Moreover, it incorporates the group fusion block (GFB) into skip connections to fuse the multi-scale features and build the context of long-distance dependencies. Extensive experiments on real and synthetic data demonstrate that our DestripeCycleGAN surpasses the state-of-the-art methods in terms of visual quality and quantitative evaluation. Our code will be made public at https://github.com/0wuji/DestripeCycleGAN. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2402.00028 [pdf, other]

Neural Rendering and Its Hardware Acceleration: A Review

Authors: Xinkai Yan, Jieting Xu, Yuchi Huo, Hujun Bao

Abstract: Neural rendering is a new image and video generation method based on deep learning. It combines the deep learning model with the physical knowledge of computer graphics, to obtain a controllable and realistic scene model, and realize the control of scene attributes such as lighting, camera parameters, posture and so on. On the one hand, neural rendering can not only make full use of the advantages… ▽ More Neural rendering is a new image and video generation method based on deep learning. It combines the deep learning model with the physical knowledge of computer graphics, to obtain a controllable and realistic scene model, and realize the control of scene attributes such as lighting, camera parameters, posture and so on. On the one hand, neural rendering can not only make full use of the advantages of deep learning to accelerate the traditional forward rendering process, but also provide new solutions for specific tasks such as inverse rendering and 3D reconstruction. On the other hand, the design of innovative hardware structures that adapt to the neural rendering pipeline breaks through the parallel computing and power consumption bottleneck of existing graphics processors, which is expected to provide important support for future key areas such as virtual and augmented reality, film and television creation and digital entertainment, artificial intelligence and the metaverse. In this paper, we review the technical connotation, main challenges, and research progress of neural rendering. On this basis, we analyze the common requirements of neural rendering pipeline for hardware acceleration and the characteristics of the current hardware acceleration architecture, and then discuss the design challenges of neural rendering processor architecture. Finally, the future development trend of neural rendering processor architecture is prospected. △ Less

Submitted 6 January, 2024; originally announced February 2024.

arXiv:2401.04389 [pdf, other]

RaD-Net: A Repairing and Denoising Network for Speech Signal Improvement

Authors: Mingshuai Liu, Zhuangqi Chen, Xiaopeng Yan, Yuanjun Lv, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

Abstract: This paper introduces our repairing and denoising network (RaD-Net) for the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. We extend our previous framework based on a two-stage network and propose an upgraded model. Specifically, we replace the repairing network with COM-Net from TEA-PSE. In addition, multi-resolution discriminators and multi-band discriminators are adopted in the training… ▽ More This paper introduces our repairing and denoising network (RaD-Net) for the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. We extend our previous framework based on a two-stage network and propose an upgraded model. Specifically, we replace the repairing network with COM-Net from TEA-PSE. In addition, multi-resolution discriminators and multi-band discriminators are adopted in the training stage. Finally, we use a three-step training strategy to optimize our model. We submit two models with different sets of parameters to meet the RTF requirement of the two tracks. According to the official results, the proposed systems rank 2nd in track 1 and 3rd in track 2. △ Less

Submitted 9 January, 2024; originally announced January 2024.

Comments: submitted to ICASSP 2024

arXiv:2401.03697 [pdf, other]

An audio-quality-based multi-strategy approach for target speaker extraction in the MISP 2023 Challenge

Authors: Runduo Han, Xiaopeng Yan, Weiming Xu, Pengcheng Guo, Jiayao Sun, He Wang, Quan Lu, Ning Jiang, Lei Xie

Abstract: This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. Specifically, our approach adopts different extraction strategies based on the audio quality, striking a balance between interference removal and speech preservation, which benifits the back-en… ▽ More This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. Specifically, our approach adopts different extraction strategies based on the audio quality, striking a balance between interference removal and speech preservation, which benifits the back-end automatic speech recognition (ASR) systems. Experiments show that our approach achieves a character error rate (CER) of 24.2% and 33.2% on the Dev and Eval set, respectively, obtaining the second place in the challenge. △ Less

Submitted 6 March, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

Comments: Accepted by ICASSP 2024

arXiv:2312.13523 [pdf]

doi 10.1002/mrm.29990

High-resolution myelin-water fraction and quantitative relaxation map** using 3D ViSTa-MR fingerprinting

Authors: Congyu Liao, Xiaozhi Cao, Siddharth Srinivasan Iyer, Sophie Schauman, Zihan Zhou, Xiaoqian Yan, Quan Chen, Zhitao Li, Nan Wang, Ting Gong, Zhe Wu, Hongjian He, Jianhui Zhong, Yang Yang, Adam Kerr, Kalanit Grill-Spector, Kawin Setsompop

Abstract: Purpose: This study aims to develop a high-resolution whole-brain multi-parametric quantitative MRI approach for simultaneous map** of myelin-water fraction (MWF), T1, T2, and proton-density (PD), all within a clinically feasible scan time. Methods: We developed 3D ViSTa-MRF, which combined Visualization of Short Transverse relaxation time component (ViSTa) technique with MR Fingerprinting (MR… ▽ More Purpose: This study aims to develop a high-resolution whole-brain multi-parametric quantitative MRI approach for simultaneous map** of myelin-water fraction (MWF), T1, T2, and proton-density (PD), all within a clinically feasible scan time. Methods: We developed 3D ViSTa-MRF, which combined Visualization of Short Transverse relaxation time component (ViSTa) technique with MR Fingerprinting (MRF), to achieve high-fidelity whole-brain MWF and T1/T2/PD map** on a clinical 3T scanner. To achieve fast acquisition and memory-efficient reconstruction, the ViSTa-MRF sequence leverages an optimized 3D tiny-golden-angle-shuffling spiral-projection acquisition and joint spatial-temporal subspace reconstruction with optimized preconditioning algorithm. With the proposed ViSTa-MRF approach, high-fidelity direct MWF map** was achieved without a need for multi-compartment fitting that could introduce bias and/or noise from additional assumptions or priors. Results: The in-vivo results demonstrate the effectiveness of the proposed acquisition and reconstruction framework to provide fast multi-parametric map** with high SNR and good quality. The in-vivo results of 1mm- and 0.66mm-iso datasets indicate that the MWF values measured by the proposed method are consistent with standard ViSTa results that are 30x slower with lower SNR. Furthermore, we applied the proposed method to enable 5-minute whole-brain 1mm-iso assessment of MWF and T1/T2/PD map**s for infant brain development and for post-mortem brain samples. Conclusions: In this work, we have developed a 3D ViSTa-MRF technique that enables the acquisition of whole-brain MWF, quantitative T1, T2, and PD maps at 1mm and 0.66mm isotropic resolution in 5 and 15 minutes, respectively. This advancement allows for quantitative investigations of myelination changes in the brain. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: 38 pages, 12 figures and 1 table

Journal ref: Magnetic Resonance in Medicine 2023

arXiv:2311.04383 [pdf, other]

Active Collision Avoidance System for E-Scooters in Pedestrian Environment

Authors: Xuke Yan, Dan Shen

Abstract: In the dense fabric of urban areas, electric scooters have rapidly become a preferred mode of transportation. As they cater to modern mobility demands, they present significant safety challenges, especially when interacting with pedestrians. In general, e-scooters are suggested to be ridden in bike lanes/sidewalks or share the road with cars at the maximum speed of about 15-20 mph, which is more f… ▽ More In the dense fabric of urban areas, electric scooters have rapidly become a preferred mode of transportation. As they cater to modern mobility demands, they present significant safety challenges, especially when interacting with pedestrians. In general, e-scooters are suggested to be ridden in bike lanes/sidewalks or share the road with cars at the maximum speed of about 15-20 mph, which is more flexible and much faster than pedestrians and bicyclists. Accurate prediction of pedestrian movement, coupled with assistant motion control of scooters, is essential in minimizing collision risks and seamlessly integrating scooters in areas dense with pedestrians. Addressing these safety concerns, our research introduces a novel e-Scooter collision avoidance system (eCAS) with a method for predicting pedestrian trajectories, employing an advanced LSTM network integrated with a state refinement module. This proactive model is designed to ensure unobstructed movement in areas with substantial pedestrian traffic without collisions. Results are validated on two public datasets, ETH and UCY, providing encouraging outcomes. Our model demonstrated proficiency in anticipating pedestrian paths and augmented scooter path planning, allowing for heightened adaptability in densely populated locales. This study shows the potential of melding pedestrian trajectory prediction with scooter motion planning. With the ubiquity of electric scooters in urban environments, such advancements have become crucial to safeguard all participants in urban transit. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: Submitted to SAE 2024

arXiv:2310.04715 [pdf, other]

An Exploration of Task-decoupling on Two-stage Neural Post Filter for Real-time Personalized Acoustic Echo Cancellation

Authors: Zihan Zhang, Jiayao Sun, Xianjun Xia, Ziqian Wang, Xiaopeng Yan, Yijian Xiao, Lei Xie

Abstract: Deep learning based techniques have been popularly adopted in acoustic echo cancellation (AEC). Utilization of speaker representation has extended the frontier of AEC, thus attracting many researchers' interest in personalized acoustic echo cancellation (PAEC). Meanwhile, task-decoupling strategies are widely adopted in speech enhancement. To further explore the task-decoupling approach, we propos… ▽ More Deep learning based techniques have been popularly adopted in acoustic echo cancellation (AEC). Utilization of speaker representation has extended the frontier of AEC, thus attracting many researchers' interest in personalized acoustic echo cancellation (PAEC). Meanwhile, task-decoupling strategies are widely adopted in speech enhancement. To further explore the task-decoupling approach, we propose to use a two-stage task-decoupling post-filter (TDPF) in PAEC. Furthermore, a multi-scale local-global speaker representation is applied to improve speaker extraction in PAEC. Experimental results indicate that the task-decoupling model can yield better performance than a single joint network. The optimal approach is to decouple the echo cancellation from noise and interference speech suppression. Based on the task-decoupling sequence, optimal training strategies for the two-stage model are explored afterwards. △ Less

Submitted 7 October, 2023; originally announced October 2023.

Comments: accepted to ASRU 2023

arXiv:2309.06780 [pdf, other]

Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms

Authors: Chu Yuan Zhang, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Xinrui Yan

Abstract: Recent strides in neural speech synthesis technologies, while enjoying widespread applications, have nonetheless introduced a series of challenges, spurring interest in the defence against the threat of misuse and abuse. Notably, source attribution of synthesized speech has value in forensics and intellectual property protection, but prior work in this area has certain limitations in scope. To add… ▽ More Recent strides in neural speech synthesis technologies, while enjoying widespread applications, have nonetheless introduced a series of challenges, spurring interest in the defence against the threat of misuse and abuse. Notably, source attribution of synthesized speech has value in forensics and intellectual property protection, but prior work in this area has certain limitations in scope. To address the gaps, we present our findings concerning the identification of the sources of synthesized speech in this paper. We investigate the existence of speech synthesis model fingerprints in the generated speech waveforms, with a focus on the acoustic model and the vocoder, and study the influence of each component on the fingerprint in the overall speech waveforms. Our research, conducted using the multi-speaker LibriTTS dataset, demonstrates two key insights: (1) vocoders and acoustic models impart distinct, model-specific fingerprints on the waveforms they generate, and (2) vocoder fingerprints are the more dominant of the two, and may mask the fingerprints from the acoustic model. These findings strongly suggest the existence of model-specific fingerprints for both the acoustic model and the vocoder, highlighting their potential utility in source identification applications. △ Less

Submitted 15 June, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: Accepted by CCL 2024

arXiv:2308.02776 [pdf, other]

Dual Degradation-Inspired Deep Unfolding Network for Low-Light Image Enhancement

Authors: Huake Wang, Xingsong Hou, Xiaoyang Yan

Abstract: Although low-light image enhancement has achieved great stride based on deep enhancement models, most of them mainly stress on enhancement performance via an elaborated black-box network and rarely explore the physical significance of enhancement models. Towards this issue, we propose a Dual degrAdation-inSpired deep Unfolding network, termed DASUNet, for low-light image enhancement. Specifically,… ▽ More Although low-light image enhancement has achieved great stride based on deep enhancement models, most of them mainly stress on enhancement performance via an elaborated black-box network and rarely explore the physical significance of enhancement models. Towards this issue, we propose a Dual degrAdation-inSpired deep Unfolding network, termed DASUNet, for low-light image enhancement. Specifically, we construct a dual degradation model (DDM) to explicitly simulate the deterioration mechanism of low-light images. It learns two distinct image priors via considering degradation specificity between luminance and chrominance spaces. To make the proposed scheme tractable, we design an alternating optimization solution to solve the proposed DDM. Further, the designed solution is unfolded into a specified deep network, imitating the iteration updating rules, to form DASUNet. Local and long-range information are obtained by prior modeling module (PMM), inheriting the advantages of convolution and Transformer, to enhance the representation capability of dual degradation priors. Additionally, a space aggregation module (SAM) is presented to boost the interaction of two degradation models. Extensive experiments on multiple popular low-light image datasets validate the effectiveness of DASUNet compared to canonical state-of-the-art low-light image enhancement methods. Our source code and pretrained model will be publicly available. △ Less

Submitted 4 August, 2023; originally announced August 2023.

Comments: 12 pages, 13 figures

arXiv:2307.09728 [pdf, other]

Uncertainty-Driven Multi-Scale Feature Fusion Network for Real-time Image Deraining

Authors: Ming Tong, Xuefeng Yan, Yongzhen Wang

Abstract: Visual-based measurement systems are frequently affected by rainy weather due to the degradation caused by rain streaks in captured images, and existing imaging devices struggle to address this issue in real-time. While most efforts leverage deep networks for image deraining and have made progress, their large parameter sizes hinder deployment on resource-constrained devices. Additionally, these d… ▽ More Visual-based measurement systems are frequently affected by rainy weather due to the degradation caused by rain streaks in captured images, and existing imaging devices struggle to address this issue in real-time. While most efforts leverage deep networks for image deraining and have made progress, their large parameter sizes hinder deployment on resource-constrained devices. Additionally, these data-driven models often produce deterministic results, without considering their inherent epistemic uncertainty, which can lead to undesired reconstruction errors. Well-calibrated uncertainty can help alleviate prediction errors and assist measurement devices in mitigating risks and improving usability. Therefore, we propose an Uncertainty-Driven Multi-Scale Feature Fusion Network (UMFFNet) that learns the probability map** distribution between paired images to estimate uncertainty. Specifically, we introduce an uncertainty feature fusion block (UFFB) that utilizes uncertainty information to dynamically enhance acquired features and focus on blurry regions obscured by rain streaks, reducing prediction errors. In addition, to further boost the performance of UMFFNet, we fused feature information from multiple scales to guide the network for efficient collaborative rain removal. Extensive experiments demonstrate that UMFFNet achieves significant performance improvements with few parameters, surpassing state-of-the-art image deraining methods. △ Less

Submitted 18 July, 2023; originally announced July 2023.

arXiv:2305.13774 [pdf, other]

ADD 2023: the Second Audio Deepfake Detection Challenge

Authors: Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li

Abstract: Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Different from previous challenges (e.g. ADD 2022), ADD 2023 focuses on s… ▽ More Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Different from previous challenges (e.g. ADD 2022), ADD 2023 focuses on surpassing the constraints of binary real/fake classification, and actually localizing the manipulated intervals in a partially fake speech as well as pinpointing the source responsible for generating any fake audio. Furthermore, ADD 2023 includes more rounds of evaluation for the fake audio game sub-challenge. The ADD 2023 challenge includes three subchallenges: audio fake game (FG), manipulation region location (RL) and deepfake algorithm recognition (AR). This paper describes the datasets, evaluation metrics, and protocols. Some findings are also reported in audio deepfake detection tasks. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2304.04106 [pdf, other]

MedGen3D: A Deep Generative Framework for Paired 3D Image and Mask Generation

Authors: Kun Han, Yifeng Xiong, Chenyu You, Pooya Khosravi, Shanlin Sun, Xiangyi Yan, James Duncan, Xiaohui Xie

Abstract: Acquiring and annotating sufficient labeled data is crucial in develo** accurate and robust learning-based models, but obtaining such data can be challenging in many medical image segmentation tasks. One promising solution is to synthesize realistic data with ground-truth mask annotations. However, no prior studies have explored generating complete 3D volumetric images with masks. In this paper,… ▽ More Acquiring and annotating sufficient labeled data is crucial in develo** accurate and robust learning-based models, but obtaining such data can be challenging in many medical image segmentation tasks. One promising solution is to synthesize realistic data with ground-truth mask annotations. However, no prior studies have explored generating complete 3D volumetric images with masks. In this paper, we present MedGen3D, a deep generative framework that can generate paired 3D medical images and masks. First, we represent the 3D medical data as 2D sequences and propose the Multi-Condition Diffusion Probabilistic Model (MC-DPM) to generate multi-label mask sequences adhering to anatomical geometry. Then, we use an image sequence generator and semantic diffusion refiner conditioned on the generated mask sequences to produce realistic 3D medical images that align with the generated masks. Our proposed framework guarantees accurate alignment between synthetic images and segmentation maps. Experiments on 3D thoracic CT and brain MRI datasets show that our synthetic data is both diverse and faithful to the original data, and demonstrate the benefits for downstream segmentation tasks. We anticipate that MedGen3D's ability to synthesize paired 3D medical images and masks will prove valuable in training deep learning models for medical imaging tasks. △ Less

Submitted 4 July, 2023; v1 submitted 8 April, 2023; originally announced April 2023.

Comments: Accepted by MICCAI 2023. Project Page: https://krishan999.github.io/MedGen3D/

arXiv:2303.06811 [pdf, other]

The NPU-Elevoc Personalized Speech Enhancement System for ICASSP2023 DNS Challenge

Authors: Xiaopeng Yan, Yindi Yang, Zhihao Guo, Liangliang Peng, Lei Xie

Abstract: This paper describes our NPU-Elevoc personalized speech enhancement system (NAPSE) for the 5th Deep Noise Suppression Challenge at ICASSP 2023. Based on the superior two-stage model TEA-PSE 2.0, our system particularly explores better strategy for speaker embedding fusion, optimizes the model training pipeline, and leverages adversarial training and multi-scale loss. According to the results, our… ▽ More This paper describes our NPU-Elevoc personalized speech enhancement system (NAPSE) for the 5th Deep Noise Suppression Challenge at ICASSP 2023. Based on the superior two-stage model TEA-PSE 2.0, our system particularly explores better strategy for speaker embedding fusion, optimizes the model training pipeline, and leverages adversarial training and multi-scale loss. According to the results, our system is tied for the 1st place in the headset track (track 1) and ranked 2nd in the speakerphone track (track 2). △ Less

Submitted 15 March, 2023; v1 submitted 12 March, 2023; originally announced March 2023.

arXiv:2301.01887 [pdf, other]

A Novel Exploitative and Explorative GWO-SVM Algorithm for Smart Emotion Recognition

Authors: Xucun Yan, Zihuai Lin, Zhiyun Lin, Branka Vucetic

Abstract: Emotion recognition or detection is broadly utilized in patient-doctor interactions for diseases such as schizophrenia and autism and the most typical techniques are speech detection and facial recognition. However, features extracted from these behavior-based emotion recognitions are not reliable since humans can disguise their emotions. Recording voices or tracking facial expressions for a long… ▽ More Emotion recognition or detection is broadly utilized in patient-doctor interactions for diseases such as schizophrenia and autism and the most typical techniques are speech detection and facial recognition. However, features extracted from these behavior-based emotion recognitions are not reliable since humans can disguise their emotions. Recording voices or tracking facial expressions for a long term is also not efficient. Therefore, our aim is to find a reliable and efficient emotion recognition scheme, which can be used for non-behavior-based emotion recognition in real-time. This can be solved by implementing a single-channel electrocardiogram (ECG) based emotion recognition scheme in a lightweight embedded system. However, existing schemes have relatively low accuracy. Therefore, we propose a reliable and efficient emotion recognition scheme - exploitative and explorative grey wolf optimizer based SVM (X - GWO - SVM) for ECG-based emotion recognition. Two datasets, one raw self-collected iRealcare dataset, and the widely-used benchmark WESAD dataset are used in the X - GWO - SVM algorithm for emotion recognition. This work demonstrates that the X - GWO - SVM algorithm can be used for emotion recognition and the algorithm exhibits superior performance in reliability compared to the use of other supervised machine learning methods in earlier works. It can be implemented in a lightweight embedded system, which is much more efficient than existing solutions based on deep neural networks. △ Less

Submitted 4 January, 2023; originally announced January 2023.

arXiv:2212.12661 [pdf, other]

Transmission Congestion Management with Generalized Generation Shift Distribution Factors

Authors: Shutong Pu, Guangchun Ruan, Xinfei Yan, Haiwang Zhong

Abstract: A major concern in modern power systems is that the popularity and fluctuating characteristics of renewable energy may cause more and more transmission congestion events. Traditional congestion management modeling involves AC or DC power flow equations, while the former equation always accompanies great amount of computation, and the latter cannot consider voltage amplitude and reactive power. The… ▽ More A major concern in modern power systems is that the popularity and fluctuating characteristics of renewable energy may cause more and more transmission congestion events. Traditional congestion management modeling involves AC or DC power flow equations, while the former equation always accompanies great amount of computation, and the latter cannot consider voltage amplitude and reactive power. Therefore, this paper proposes a congestion management approach incorporating a specially-designed generalized generator shift distribution factor (GSDF) to derive a computationally-efficient and accurate management strategies. This congestion management strategy involves multiple balancing generators for generation shift operation. The proposed model is superior in a low computational complexity (linear equation) and versatile modeling representation with full consideration of voltage amplitude and reactive power. △ Less

Submitted 24 December, 2022; originally announced December 2022.

Comments: 5 pages, 4 figures. Accepted by conference: ICPES 2022

arXiv:2210.06973 [pdf, other]

Contrastive Psudo-supervised Classification for Intra-Pulse Modulation of Radar Emitter Signals Using data augmentation

Authors: HanCong Feng, XinHai Yan, KaiLi Jiang, XinYu Zhao, Bin Tang

Abstract: The automatic classification of radar waveform is a fundamental technique in electronic countermeasures (ECM).Recent supervised deep learning-based methods have achieved great success in a such classification task.However, those methods require enough labeled samples to work properly and in many circumstances, it is not available.To tackle this problem, in this paper, we propose a three-stages dee… ▽ More The automatic classification of radar waveform is a fundamental technique in electronic countermeasures (ECM).Recent supervised deep learning-based methods have achieved great success in a such classification task.However, those methods require enough labeled samples to work properly and in many circumstances, it is not available.To tackle this problem, in this paper, we propose a three-stages deep radar waveform clustering(DRSC) technique to automatically group the received signal samples without labels.Firstly, a pretext model is trained in a self-supervised way with the help of several data augmentation techniques to extract the class-dependent features.Next,the pseudo-supervised contrastive training is involved to further promote the separation between the extracted class-dependent features.And finally, the unsupervised problem is converted to a semi-supervised classification problem via pseudo label generation. The simulation results show that the proposed algorithm can effectively extract class-dependent features, outperforming several unsupervised clustering methods, even reaching performance on par with the supervised deep learning-based methods. △ Less

Submitted 13 October, 2022; originally announced October 2022.

arXiv:2209.13915 [pdf, ps, other]

Joint Optimization of Resource Allocation and Trajectory Control for Mobile Group Users in Fixed-Wing UAV-Enabled Wireless Network

Authors: Xuezhen Yan, Xuming Fang, Cailian Deng, Xianbin Wang

Abstract: Owing to the controlling flexibility and cost-effectiveness, fixed-wing unmanned aerial vehicles (UAVs) are expected to serve as flying base stations (BSs) in the air-ground integrated network. By exploiting the mobility of UAVs, controllable coverage can be provided for mobile group users (MGUs) under challenging scenarios or even somewhere without communication infrastructure. However, in such d… ▽ More Owing to the controlling flexibility and cost-effectiveness, fixed-wing unmanned aerial vehicles (UAVs) are expected to serve as flying base stations (BSs) in the air-ground integrated network. By exploiting the mobility of UAVs, controllable coverage can be provided for mobile group users (MGUs) under challenging scenarios or even somewhere without communication infrastructure. However, in such dual mobility scenario where the UAV and MGUs are all moving, both the non-hovering feature of the fixed-wing UAV and the movement of MGUs will exacerbate the dynamic changes of user scheduling, which eventually leads to the degradation of MGUs' quality-of-service (QoS). In this paper, we propose a fixed-wing UAV-enabled wireless network architecture to provide moving coverage for MGUs. In order to achieve fairness among MGUs, we maximize the minimum average throughput between all users by jointly optimizing the user scheduling, resource allocation, and UAV trajectory control under the constraints on users' QoS requirements, communication resources, and UAV trajectory switching. Considering the optimization problem is mixed-integer non-convex, we decompose it into three optimization subproblems. An efficient algorithm is proposed to solve these three subproblems alternately till the convergence is realized. Simulation results demonstrate that the proposed algorithm can significantly improve the minimum average throughput of MGUs. △ Less

Submitted 28 September, 2022; originally announced September 2022.

Comments: 30 pages, 9 figures

arXiv:2208.10489 [pdf, other]

System Fingerprint Recognition for Deepfake Audio: An Initial Dataset and Investigation

Authors: Xinrui Yan, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Junzuo Zhou, Hao Gu, Ruibo Fu

Abstract: The rapid progress of deep speech synthesis models has posed significant threats to society such as malicious content manipulation. Therefore, many studies have emerged to detect the so-called deepfake audio. However, existing works focus on the binary detection of real audio and fake audio. In real-world scenarios such as model copyright protection and digital evidence forensics, it is needed to… ▽ More The rapid progress of deep speech synthesis models has posed significant threats to society such as malicious content manipulation. Therefore, many studies have emerged to detect the so-called deepfake audio. However, existing works focus on the binary detection of real audio and fake audio. In real-world scenarios such as model copyright protection and digital evidence forensics, it is needed to know what tool or model generated the deepfake audio to explain the decision. This motivates us to ask: Can we recognize the system fingerprints of deepfake audio? In this paper, we present the first deepfake audio dataset for system fingerprint recognition (SFR) and conduct an initial investigation. We collected the dataset from the speech synthesis systems of seven Chinese vendors that use the latest state-of-the-art deep learning technologies, including both clean and compressed sets. In addition, to facilitate the further development of system fingerprint recognition methods, we provide extensive benchmarks that can be compared and research findings. The dataset will be publicly available. . △ Less

Submitted 15 September, 2023; v1 submitted 21 August, 2022; originally announced August 2022.

Comments: 13 pages, 4 figures. Submit to IEEE Transactions on Audio, Speech and Language Processing (TASLP). arXiv admin note: text overlap with arXiv:2208.09646

arXiv:2208.09646 [pdf, other]

doi 10.1145/3552466.3556525

An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio

Authors: Xinrui Yan, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Haoxin Ma, Tao Wang, Shiming Wang, Ruibo Fu

Abstract: Many effective attempts have been made for fake audio detection. However, they can only provide detection results but no countermeasures to curb this harm. For many related practical applications, what model or algorithm generated the fake audio also is needed. Therefore, We propose a new problem for detecting vocoder fingerprints of fake audio. Experiments are conducted on the datasets synthesize… ▽ More Many effective attempts have been made for fake audio detection. However, they can only provide detection results but no countermeasures to curb this harm. For many related practical applications, what model or algorithm generated the fake audio also is needed. Therefore, We propose a new problem for detecting vocoder fingerprints of fake audio. Experiments are conducted on the datasets synthesized by eight state-of-the-art vocoders. We have preliminarily explored the features and model architectures. The t-SNE visualization shows that different vocoders generate distinct vocoder fingerprints. △ Less

Submitted 20 August, 2022; originally announced August 2022.

Comments: Accepted by ACM Multimedia 2022 Workshop: First International Workshop on Deepfake Detection for Audio Multimedia

arXiv:2207.12308 [pdf, other]

CFAD: A Chinese Dataset for Fake Audio Detection

Authors: Haoxin Ma, Jiangyan Yi, Chenglong Wang, Xinrui Yan, Jianhua Tao, Tao Wang, Shiming Wang, Ruibo Fu

Abstract: Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions.In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake… ▽ More Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions.In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake audio. To simulate the real-life scenarios, three noise datasets are selected for noise adding at five different signal-to-noise ratios, and six codecs are considered for audio transcoding (format conversion). CFAD dataset can be used not only for fake audio detection but also for detecting the algorithms of fake utterances for audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging. The CFAD dataset is publicly available at: https://zenodo.org/record/8122764. △ Less

Submitted 18 July, 2023; v1 submitted 12 July, 2022; originally announced July 2022.

Comments: FAD renamed as CFAD

arXiv:2203.12787 [pdf, other]

Design of an Internet of Things System for Smart Hospitals

Authors: Jichao Leng, Xucun Yan, Zihuai Lin

Abstract: With the fast advancement of smart devices and Internet of Things (IoT) technologies, certain established situations are opening up new avenues of exploration. Particularly in the sphere of healthcare, the diverse and big population, the complicated and professional data, and the stringent environmental requirements for certain medical scenes and equipment all impose exceptionally high standards o… ▽ More With the fast advancement of smart devices and Internet of Things (IoT) technologies, certain established situations are opening up new avenues of exploration. Particularly in the sphere of healthcare, the diverse and big population, the complicated and professional data, and the stringent environmental requirements for certain medical scenes and equipment all impose exceptionally high standards on hospital administration. As a result, an effective and secure Internet of things system is critical. This article proposes an IoT system that might be used in hospitals for a variety of purposes. This system collects data by LoRa, Wi-Fi, and other ways, uploads it to a cloud platform for processing over a secure connection, and then feeds it back to users in real-time via the user interface. This system enables precise indoor localization through the use of UWB, ECG signal detection, environmental monitoring, and data on people flow. △ Less

Submitted 6 April, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

arXiv:2202.08433 [pdf, ps, other]

ADD 2022: the First Audio Deep Synthesis Detection Challenge

Authors: Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu

Abstract: Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake gam… ▽ More Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake game (FG). The LF track focuses on dealing with bona fide and fully fake utterances with various real-world noises etc. The PF track aims to distinguish the partially fake audio from the real. The FG track is a rivalry game, which includes two tasks: an audio generation task and an audio fake detection task. In this paper, we describe the datasets, evaluation metrics, and protocols. We also report major findings that reflect the recent advances in audio deepfake detection tasks. △ Less

Submitted 26 February, 2022; v1 submitted 16 February, 2022; originally announced February 2022.

Comments: Accepted by ICASSP 2022

arXiv:2202.06284 [pdf, ps, other]

Significant Low-dimensional Spectral-temporal Features for Seizure Detection

Authors: Xucun Yan, Dong** Yang, Zihuai Lin, Branka Vucetic

Abstract: Seizure onset detection in electroencephalography (EEG) signals is a challenging task due to the non-stereotyped seizure activities as well as their stochastic and non-stationary characteristics in nature. Joint spectral-temporal features are believed to contain sufficient and powerful feature information for absence seizure detection. However, the resulting high-dimensional features involve redun… ▽ More Seizure onset detection in electroencephalography (EEG) signals is a challenging task due to the non-stereotyped seizure activities as well as their stochastic and non-stationary characteristics in nature. Joint spectral-temporal features are believed to contain sufficient and powerful feature information for absence seizure detection. However, the resulting high-dimensional features involve redundant information and require heavy computational load. Here, we discover significant low-dimensional spectral-temporal features in terms of mean-standard deviation of wavelet transform coefficient (MS-WTC), based on which a novel absence seizure detection framework is developed. The EEG signals are transformed into the spectral-temporal domain, with their low-dimensional features fed into a convolutional neural network. Superior detection performance is achieved on the widely-used benchmark dataset as well as a clinical dataset from the Chinese 301 Hospital. For the former, seven classification tasks were evaluated with the accuracy from 99.8% to 100.0%, while for the latter, the method achieved a mean accuracy of 94.7%, overwhelming other methods with low-dimensional temporal and spectral features. Experimental results on two seizure datasets demonstrate reliability, efficiency and stability of our proposed MS-WTC method, validating the significance of the extracted low-dimensional spectral-temporal features. △ Less

Submitted 13 February, 2022; originally announced February 2022.

arXiv:2201.10083 [pdf, other]

A Wearable ECG Monitor for Deep Learning Based Real-Time Cardiovascular Disease Detection

Authors: Peng Wang, Zihuai Lin, Xucun Yan, Zijiao Chen, Ming Ding, Yang Song, Lu Meng

Abstract: Cardiovascular disease has become one of the most significant threats endangering human life and health. Recently, Electrocardiogram (ECG) monitoring has been transformed into remote cardiac monitoring by Holter surveillance. However, the widely used Holter can bring a great deal of discomfort and inconvenience to the individuals who carry them. We developed a new wireless ECG patch in this work a… ▽ More Cardiovascular disease has become one of the most significant threats endangering human life and health. Recently, Electrocardiogram (ECG) monitoring has been transformed into remote cardiac monitoring by Holter surveillance. However, the widely used Holter can bring a great deal of discomfort and inconvenience to the individuals who carry them. We developed a new wireless ECG patch in this work and applied a deep learning framework based on the Convolutional Neural Network (CNN) and Long Short-term Memory (LSTM) models. However, we find that the models using the existing techniques are not able to differentiate two main heartbeat types (Supraventricular premature beat and Atrial fibrillation) in our newly obtained dataset, resulting in low accuracy of 58.0 %. We proposed a semi-supervised method to process the badly labelled data samples with using the confidence-level-based training. The experiment results conclude that the proposed method can approach an average accuracy of 90.2 %, i.e., 5.4 % higher than the accuracy of conventional ECG classification methods. △ Less

Submitted 24 January, 2022; originally announced January 2022.

arXiv:2110.10403 [pdf, other]

AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation

Authors: Xiangyi Yan, Hao Tang, Shanlin Sun, Haoyu Ma, Deying Kong, Xiaohui Xie

Abstract: Recent advances in transformer-based models have drawn attention to exploring these techniques in medical image segmentation, especially in conjunction with the U-Net model (or its variants), which has shown great success in medical image segmentation, under both 2D and 3D settings. Current 2D based methods either directly replace convolutional layers with pure transformers or consider a transform… ▽ More Recent advances in transformer-based models have drawn attention to exploring these techniques in medical image segmentation, especially in conjunction with the U-Net model (or its variants), which has shown great success in medical image segmentation, under both 2D and 3D settings. Current 2D based methods either directly replace convolutional layers with pure transformers or consider a transformer as an additional intermediate encoder between the encoder and decoder of U-Net. However, these approaches only consider the attention encoding within one single slice and do not utilize the axial-axis information naturally provided by a 3D volume. In the 3D setting, convolution on volumetric data and transformers both consume large GPU memory. One has to either downsample the image or use cropped local patches to reduce GPU memory usage, which limits its performance. In this paper, we propose Axial Fusion Transformer UNet (AFTer-UNet), which takes both advantages of convolutional layers' capability of extracting detailed features and transformers' strength on long sequence modeling. It considers both intra-slice and inter-slice long-range cues to guide the segmentation. Meanwhile, it has fewer parameters and takes less GPU memory to train than the previous transformer-based models. Extensive experiments on three multi-organ segmentation datasets demonstrate that our method outperforms current state-of-the-art methods. △ Less

Submitted 20 October, 2021; originally announced October 2021.

arXiv:2108.01522 [pdf, other]

CSMCNet: Scalable Video Compressive Sensing Reconstruction with Interpretable Motion Estimation

Authors: Bowen Huang, Xiao Yan, **jia Zhou, Yibo Fan

Abstract: Most deep network methods for compressive sensing reconstruction suffer from the black-box characteristic of DNN. In this paper, a deep neural network with interpretable motion estimation named CSMCNet is proposed. The network is able to realize high-quality reconstruction of video compressive sensing by unfolding the iterative steps of optimization based algorithms. A DNN based, multi-hypothesis… ▽ More Most deep network methods for compressive sensing reconstruction suffer from the black-box characteristic of DNN. In this paper, a deep neural network with interpretable motion estimation named CSMCNet is proposed. The network is able to realize high-quality reconstruction of video compressive sensing by unfolding the iterative steps of optimization based algorithms. A DNN based, multi-hypothesis motion estimation module is designed to improve the reconstruction quality, and a residual module is employed to further narrow down the gap between re-construction results and original signal in our proposed method. Besides, we propose an interpolation module with corresponding training strategy to realize scalable CS reconstruction, which is capable of using the same model to decode various compression ratios. Experiments show that a PSNR of 29.34dB can be achieved at 2% CS ratio (compressed by 98%), which is superior than other state-of-the-art methods. Moreover, the interpolation module is proved to be effective, with significant cost saving and acceptable performance losses. △ Less

Submitted 3 August, 2021; originally announced August 2021.

Comments: 12 pages, 10 pages, 5 tables

arXiv:2103.01661 [pdf, other]

Incorporating VAD into ASR System by Multi-task Learning

Authors: Meng Li, Xia Yan, Feng Lin

Abstract: When we use End-to-end automatic speech recognition (E2E-ASR) system for real-world applications, a voice activity detection (VAD) system is usually needed to improve the performance and to reduce the computational cost by discarding non-speech parts in the audio. Usually ASR and VAD systems are trained and utilized independently to each other. In this paper, we present a novel multi-task learning… ▽ More When we use End-to-end automatic speech recognition (E2E-ASR) system for real-world applications, a voice activity detection (VAD) system is usually needed to improve the performance and to reduce the computational cost by discarding non-speech parts in the audio. Usually ASR and VAD systems are trained and utilized independently to each other. In this paper, we present a novel multi-task learning (MTL) framework that incorporates VAD into the ASR system. The proposed system learns ASR and VAD jointly in the training stage. With the assistance of VAD, the ASR performance improves as its connectionist temporal classification (CTC) loss function can leverage the VAD alignment information. In the inference stage, the proposed system removes non-speech parts at low computational cost and recognizes speech parts with high robustness. Experimental results on segmented speech data show that by utilizing VAD information, the proposed method outperforms the baseline ASR system on both English and Chinese datasets. On unsegmented speech data, we find that the system outperforms the ASR systems that build an extra GMM-based or DNN-based voice activity detector. △ Less

Submitted 30 September, 2022; v1 submitted 2 March, 2021; originally announced March 2021.

Comments: 5 pages, 2 figures

arXiv:2101.02828 [pdf, other]

Distributionally Consistent Simulation of Naturalistic Driving Environment for Autonomous Vehicle Testing

Authors: Xintao Yan, Shuo Feng, Haowei Sun, Henry X. Liu

Abstract: Microscopic traffic simulation provides a controllable, repeatable, and efficient testing environment for autonomous vehicles (AVs). To evaluate AVs' safety performance unbiasedly, the probability distributions of environment statistics in the simulated naturalistic driving environment (NDE) need to be consistent with those from the real-world driving environment. However, although human driving b… ▽ More Microscopic traffic simulation provides a controllable, repeatable, and efficient testing environment for autonomous vehicles (AVs). To evaluate AVs' safety performance unbiasedly, the probability distributions of environment statistics in the simulated naturalistic driving environment (NDE) need to be consistent with those from the real-world driving environment. However, although human driving behaviors have been extensively investigated in the transportation engineering field, most existing models were developed for traffic flow analysis without considering the distributional consistency of driving behaviors, which could cause significant evaluation biasedness for AV testing. To fill this research gap, a distributional consistent NDE modeling framework is proposed in this paper. Using large-scale naturalistic driving data, empirical distributions are obtained to construct the stochastic human driving behavior models under different conditions. To address the error accumulation problem during the simulation, an optimization-based method is further designed to refine the empirical behavior models. Specifically, the vehicle state evolution is modeled as a Markov chain and its stationary distribution is twisted to match the distribution from the real-world driving environment. The framework is evaluated in the case study of a multi-lane highway driving simulation, where the distributional accuracy of the generated NDE is validated and the safety performance of an AV model is effectively evaluated. △ Less

Submitted 1 July, 2022; v1 submitted 7 January, 2021; originally announced January 2021.

Comments: 13 pages, 11 figures

arXiv:2010.03780 [pdf, other]

CS-MCNet:A Video Compressive Sensing Reconstruction Network with Interpretable Motion Compensation

Authors: Bowen Huang, **jia Zhou, Xiao Yan, Ming'e **g, Rentao Wan, Yibo Fan

Abstract: In this paper, a deep neural network with interpretable motion compensation called CS-MCNet is proposed to realize high-quality and real-time decoding of video compressive sensing. Firstly, explicit multi-hypothesis motion compensation is applied in our network to extract correlation information of adjacent frames(as shown in Fig. 1), which improves the recover performance. And then, a residual mo… ▽ More In this paper, a deep neural network with interpretable motion compensation called CS-MCNet is proposed to realize high-quality and real-time decoding of video compressive sensing. Firstly, explicit multi-hypothesis motion compensation is applied in our network to extract correlation information of adjacent frames(as shown in Fig. 1), which improves the recover performance. And then, a residual module further narrows down the gap between reconstruction result and original signal. The overall architecture is interpretable by using algorithm unrolling, which brings the benefits of being able to transfer prior knowledge about the conventional algorithms. As a result, a PSNR of 22dB can be achieved at 64x compression ratio, which is about 4% to 9% better than state-of-the-art methods. In addition, due to the feed-forward architecture, the reconstruction can be processed by our network in real time and up to three orders of magnitude faster than traditional iterative methods. △ Less

Submitted 8 October, 2020; originally announced October 2020.

Comments: 15pages, ACCV2020 accepted paper

arXiv:2007.06151 [pdf, other]

MS-NAS: Multi-Scale Neural Architecture Search for Medical Image Segmentation

Authors: Xingang Yan, Weiwen Jiang, Yiyu Shi, Cheng Zhuo

Abstract: The recent breakthroughs of Neural Architecture Search (NAS) have motivated various applications in medical image segmentation. However, most existing work either simply rely on hyper-parameter tuning or stick to a fixed network backbone, thereby limiting the underlying search space to identify more efficient architecture. This paper presents a Multi-Scale NAS (MS-NAS) framework that is featured w… ▽ More The recent breakthroughs of Neural Architecture Search (NAS) have motivated various applications in medical image segmentation. However, most existing work either simply rely on hyper-parameter tuning or stick to a fixed network backbone, thereby limiting the underlying search space to identify more efficient architecture. This paper presents a Multi-Scale NAS (MS-NAS) framework that is featured with multi-scale search space from network backbone to cell operation, and multi-scale fusion capability to fuse features with different sizes. To mitigate the computational overhead due to the larger search space, a partial channel connection scheme and a two-step decoding method are utilized to reduce computational overhead while maintaining optimization quality. Experimental results show that on various datasets for segmentation, MS-NAS outperforms the state-of-the-art methods and achieves 0.6-5.4% mIOU and 0.4-3.5% DSC improvements, while the computational resource consumption is reduced by 18.0-24.9%. △ Less

Submitted 12 July, 2020; originally announced July 2020.

arXiv:2006.08497 [pdf, other]

doi 10.1016/j.specom.2020.06.005

An Iterative Graph Spectral Subtraction Method for Speech Enhancement

Authors: Xue Yan, Zhen Yang, Tingting Wang, Haiyan Guo

Abstract: In this paper, we investigate the application of graph signal processing (GSP) theory in speech enhancement. We first propose a set of shift operators to construct graph speech signals, and then analyze their spectrum in the graph Fourier domain. By leveraging the differences between the spectrum of graph speech and graph noise signals, we further propose the graph spectral subtraction (GSS) metho… ▽ More In this paper, we investigate the application of graph signal processing (GSP) theory in speech enhancement. We first propose a set of shift operators to construct graph speech signals, and then analyze their spectrum in the graph Fourier domain. By leveraging the differences between the spectrum of graph speech and graph noise signals, we further propose the graph spectral subtraction (GSS) method to suppress the noise interference in noisy speech. Moreover, based on GSS, we propose the iterative graph spectral subtraction (IGSS) method to further improve the speech enhancement performance. Our experimental results show that the proposed operators are suitable for graph speech signals, and the proposed methods outperform the traditional basic spectral subtraction (BSS) method and iterative basic spectral subtraction (IBSS) method in terms of both signal-to-noise ratios (SNR) and mean Perceptual Evaluation of Speech Quality (PESQ). △ Less

Submitted 15 June, 2020; originally announced June 2020.

Journal ref: SPECOM_SPECOM_2020_15

arXiv:2005.11626 [pdf, other]

ShapeAdv: Generating Shape-Aware Adversarial 3D Point Clouds

Authors: Kibok Lee, Zhuoyuan Chen, Xinchen Yan, Raquel Urtasun, Ersin Yumer

Abstract: We introduce ShapeAdv, a novel framework to study shape-aware adversarial perturbations that reflect the underlying shape variations (e.g., geometric deformations and structural differences) in the 3D point cloud space. We develop shape-aware adversarial 3D point cloud attacks by leveraging the learned latent space of a point cloud auto-encoder where the adversarial noise is applied in the latent… ▽ More We introduce ShapeAdv, a novel framework to study shape-aware adversarial perturbations that reflect the underlying shape variations (e.g., geometric deformations and structural differences) in the 3D point cloud space. We develop shape-aware adversarial 3D point cloud attacks by leveraging the learned latent space of a point cloud auto-encoder where the adversarial noise is applied in the latent space. Specifically, we propose three different variants including an exemplar-based one by guiding the shape deformation with auxiliary data, such that the generated point cloud resembles the shape morphing between objects in the same category. Different from prior works, the resulting adversarial 3D point clouds reflect the shape variations in the 3D point cloud space while still being close to the original one. In addition, experimental evaluations on the ModelNet40 benchmark demonstrate that our adversaries are more difficult to defend with existing point cloud defense methods and exhibit a higher attack transferability across classifiers. Our shape-aware adversarial attacks are orthogonal to existing point cloud based attacks and shed light on the vulnerability of 3D deep neural networks. △ Less

Submitted 23 May, 2020; originally announced May 2020.

Comments: 3D Point Clouds, Adversarial Learning

arXiv:2003.08525

Extremal Region Analysis based Deep Learning Framework for Detecting Defects

Authors: Zelin Deng, Xiaolong Yan, Shengjun Zhang, Colleen P. Bailey

Abstract: A maximally stable extreme region (MSER) analysis based convolutional neural network (CNN) for unified defect detection framework is proposed in this paper. Our proposed framework utilizes the generality and stability of MSER to generate the desired defect candidates. Then a specific trained binary CNN classifier is adopted over the defect candidates to produce the final defect set. Defect dataset… ▽ More A maximally stable extreme region (MSER) analysis based convolutional neural network (CNN) for unified defect detection framework is proposed in this paper. Our proposed framework utilizes the generality and stability of MSER to generate the desired defect candidates. Then a specific trained binary CNN classifier is adopted over the defect candidates to produce the final defect set. Defect datasets over different categories \blue{are used} in the experiments. More generally, the parameter settings in MSER can be adjusted to satisfy different requirements in various industries (high precision, high recall, etc). Extensive experimental results have shown the efficacy of the proposed framework. △ Less

Submitted 22 May, 2020; v1 submitted 18 March, 2020; originally announced March 2020.

Comments: Unsatisfied with results

arXiv:2002.04705 [pdf, other]

Smart Cameras

Authors: David J. Brady, Minghao Hu, Chengyu Wang, Xuefei Yan, Lu Fang, Yiwnheng Zhu, Yang Tan, Ming Cheng, Zhan Ma

Abstract: We review camera architecture in the age of artificial intelligence. Modern cameras use physical components and software to capture, compress and display image data. Over the past 5 years, deep learning solutions have become superior to traditional algorithms for each of these functions. Deep learning enables 10-100x reduction in electrical sensor power per pixel, 10x improvement in depth of field… ▽ More We review camera architecture in the age of artificial intelligence. Modern cameras use physical components and software to capture, compress and display image data. Over the past 5 years, deep learning solutions have become superior to traditional algorithms for each of these functions. Deep learning enables 10-100x reduction in electrical sensor power per pixel, 10x improvement in depth of field and dynamic range and 10-100x improvement in image pixel count. Deep learning enables multiframe and multiaperture solutions that fundamentally shift the goals of physical camera design. Here we review the state of the art of deep learning in camera operations and consider the impact of AI on the physical design of cameras. △ Less

Submitted 11 February, 2020; originally announced February 2020.

arXiv:2002.00529 [pdf, ps, other]

Genetic Algorithm Optimized Support Vector Machine in NOMA-Based Satellite Networks with Imperfect CSI

Authors: Xiaojuan Yan, Kang An, Cheng-Xiang Wang, Wei-** Zhu, Yusheng Li, Zhiqiang Feng

Abstract: With the help of a power-domain non-orthogonal multiple access (NOMA) scheme, satellite networks can simultaneously serve multiple users within limited time/spectrum resource block. However, the existence of channel estimation errors inevitably degrade the judgment on users' channel state information (CSI) accuracy, thus affecting the user pairing processing and suppressing the superiority of the… ▽ More With the help of a power-domain non-orthogonal multiple access (NOMA) scheme, satellite networks can simultaneously serve multiple users within limited time/spectrum resource block. However, the existence of channel estimation errors inevitably degrade the judgment on users' channel state information (CSI) accuracy, thus affecting the user pairing processing and suppressing the superiority of the NOMA scheme. Inspired by the advantages of machine learning (ML) algorithms, we propose an improved support vector machine (SVM) scheme to reduce the inappropriate user pairing risks and enhance the performance of NOMA based satellite networks with imperfect CSI. Particularly, a genetic algorithm (GA) is employed to optimize the regularization and kernel parameters of the SVM, which effectively improves the classification accuracy of the proposed scheme. Simulations are provided to demonstrate that the performance of the proposed method is better than that with random user paring strategy, especially in the scenario with a large number of users. △ Less

Submitted 2 February, 2020; originally announced February 2020.

arXiv:2001.08869 [pdf, other]

Nonparametric Structure Regularization Machine for 2D Hand Pose Estimation

Authors: Yifei Chen, Haoyu Ma, Deying Kong, Xiangyi Yan, Jianbao Wu, Wei Fan, Xiaohui Xie

Abstract: Hand pose estimation is more challenging than body pose estimation due to severe articulation, self-occlusion and high dexterity of the hand. Current approaches often rely on a popular body pose algorithm, such as the Convolutional Pose Machine (CPM), to learn 2D keypoint features. These algorithms cannot adequately address the unique challenges of hand pose estimation, because they are trained so… ▽ More Hand pose estimation is more challenging than body pose estimation due to severe articulation, self-occlusion and high dexterity of the hand. Current approaches often rely on a popular body pose algorithm, such as the Convolutional Pose Machine (CPM), to learn 2D keypoint features. These algorithms cannot adequately address the unique challenges of hand pose estimation, because they are trained solely based on keypoint positions without seeking to explicitly model structural relationship between them. We propose a novel Nonparametric Structure Regularization Machine (NSRM) for 2D hand pose estimation, adopting a cascade multi-task architecture to learn hand structure and keypoint representations jointly. The structure learning is guided by synthetic hand mask representations, which are directly computed from keypoint positions, and is further strengthened by a novel probabilistic representation of hand limbs and an anatomically inspired composition strategy of mask synthesis. We conduct extensive studies on two public datasets - OneHand 10k and CMU Panoptic Hand. Experimental results demonstrate that explicitly enforcing structure learning consistently improves pose estimation accuracy of CPM baseline models, by 1.17% on the first dataset and 4.01% on the second one. The implementation and experiment code is freely available online. Our proposal of incorporating structural learning to hand pose estimation requires no additional training information, and can be a generic add-on module to other pose estimation models. △ Less

Submitted 23 January, 2020; originally announced January 2020.

Comments: The paper has be accepted and will be presented at 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). The code is freely available at https://github.com/HowieMa/NSRMhand

arXiv:1911.08030 [pdf, other]

Driver Identification Based on Vehicle Telematics Data using LSTM-Recurrent Neural Network

Authors: Abenezer Girma, Xuyang Yan, Abdollah Homaifar

Abstract: Despite advancements in vehicle security systems, over the last decade, auto-theft rates have increased, and cyber-security attacks on internet-connected and autonomous vehicles are becoming a new threat. In this paper, a deep learning model is proposed, which can identify drivers from their driving behaviors based on vehicle telematics data. The proposed Long-Short-Term-Memory (LSTM) model predic… ▽ More Despite advancements in vehicle security systems, over the last decade, auto-theft rates have increased, and cyber-security attacks on internet-connected and autonomous vehicles are becoming a new threat. In this paper, a deep learning model is proposed, which can identify drivers from their driving behaviors based on vehicle telematics data. The proposed Long-Short-Term-Memory (LSTM) model predicts the identity of the driver based on the individual's unique driving patterns learned from the vehicle telematics data. Given the telematics is time-series data, the problem is formulated as a time series prediction task to exploit the embedded sequential information. The performance of the proposed approach is evaluated on three naturalistic driving datasets, which gives high accuracy prediction results. The robustness of the model on noisy and anomalous data that is usually caused by sensor defects or environmental factors is also investigated. Results show that the proposed model prediction accuracy remains satisfactory and outperforms the other approaches despite the extent of anomalies and noise-induced in the data. △ Less

Submitted 18 November, 2019; originally announced November 2019.

Comments: IEEE ICTAI 2019

arXiv:1908.10903 [pdf, other]

Compressive Sampling for Array Cameras

Authors: Xuefei Yan, David J. Brady, Jianqiang Wang, Chao Huang, Zian Li, Songsong Yan, Di Liu, Zhan Ma

Abstract: While design of high performance lenses and image sensors has long been the focus of camera development, the size, weight and power of image data processing components is currently the primary barrier to radical improvements in camera resolution. Here we show that Deep-Learning- Aided Compressive Sampling (DLACS) can reduce operating power on camera-head electronics by 20x. Traditional compressive… ▽ More While design of high performance lenses and image sensors has long been the focus of camera development, the size, weight and power of image data processing components is currently the primary barrier to radical improvements in camera resolution. Here we show that Deep-Learning- Aided Compressive Sampling (DLACS) can reduce operating power on camera-head electronics by 20x. Traditional compressive sampling has to date been primarily applied in the physical sensor layer, we show here that with aid from deep learning algorithms, compressive sampling offers unique power management advantages in digital layer compression. △ Less

Submitted 28 August, 2019; originally announced August 2019.

Showing 1–50 of 53 results for author: Yan, X