Search | arXiv e-print repository

Revisiting Multi-User Downlink in IEEE 802.11ax: A Designers Guide to MU-MIMO

Authors: Liu Cao, Lyutianyang Zhang, Sumit Roy, Sian **

Abstract: Downlink (DL) Multi-User (MU) Multiple Input Multiple Output (MU-MIMO) is a key technology that allows multiple concurrent data transmissions from an Access Point (AP) to a selected sub-set of clients for higher network efficiency in IEEE 802.11ax. However, DL MU-MIMO feature is typically turned off as the default setting in AP vendors' products, that is, turning on the DL MU-MIMO may not help inc… ▽ More Downlink (DL) Multi-User (MU) Multiple Input Multiple Output (MU-MIMO) is a key technology that allows multiple concurrent data transmissions from an Access Point (AP) to a selected sub-set of clients for higher network efficiency in IEEE 802.11ax. However, DL MU-MIMO feature is typically turned off as the default setting in AP vendors' products, that is, turning on the DL MU-MIMO may not help increase the network efficiency, which is counter-intuitive. In this article, we provide a sufficiently deep understanding of the interplay between the various underlying factors, i.e., CSI overhead and spatial correlation, which result in negative results when turning on the DL MU-MIMO. Furthermore, we provide a fundamental guideline as a function of operational scenarios to address the fundamental question "when the DL MU-MIMO should be turned on/off". △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: This work has been submitted to the IEEE for possible publication. 7 pages, 6 figures, magazine paper

arXiv:2406.00085 [pdf, other]

Augmentation-based Unsupervised Cross-Domain Functional MRI Adaptation for Major Depressive Disorder Identification

Authors: Yunling Ma, Chaojun Zhang, Xiaochuan Wang, Qianqian Wang, Liang Cao, Limei Zhang, Mingxia Liu

Abstract: Major depressive disorder (MDD) is a common mental disorder that typically affects a person's mood, cognition, behavior, and physical health. Resting-state functional magnetic resonance imaging (rs-fMRI) data are widely used for computer-aided diagnosis of MDD. While multi-site fMRI data can provide more data for training reliable diagnostic models, significant cross-site data heterogeneity would… ▽ More Major depressive disorder (MDD) is a common mental disorder that typically affects a person's mood, cognition, behavior, and physical health. Resting-state functional magnetic resonance imaging (rs-fMRI) data are widely used for computer-aided diagnosis of MDD. While multi-site fMRI data can provide more data for training reliable diagnostic models, significant cross-site data heterogeneity would result in poor model generalizability. Many domain adaptation methods are designed to reduce the distributional differences between sites to some extent, but usually ignore overfitting problem of the model on the source domain. Intuitively, target data augmentation can alleviate the overfitting problem by forcing the model to learn more generalized features and reduce the dependence on source domain data. In this work, we propose a new augmentation-based unsupervised cross-domain fMRI adaptation (AUFA) framework for automatic diagnosis of MDD. The AUFA consists of 1) a graph representation learning module for extracting rs-fMRI features with spatial attention, 2) a domain adaptation module for feature alignment between source and target data, 3) an augmentation-based self-optimization module for alleviating model overfitting on the source domain, and 4) a classification module. Experimental results on 1,089 subjects suggest that AUFA outperforms several state-of-the-art methods in MDD identification. Our approach not only reduces data heterogeneity between different sites, but also localizes disease-related functional connectivity abnormalities and provides interpretability for the model. △ Less

Submitted 6 June, 2024; v1 submitted 31 May, 2024; originally announced June 2024.

arXiv:2405.11115 [pdf]

Ptychographic non-line-of-sight imaging for depth-resolved visualization of hidden objects

Authors: Pengming Song, Qianhao Zhao, Ruihai Wang, Ninghe Liu, Yingqi Qiang, Tianbo Wang, Xincheng Zhang, Yi Zhang, Liangcai Cao, Guoan Zheng

Abstract: Non-line-of-sight (NLOS) imaging enables the visualization of objects hidden from direct view, with applications in surveillance, remote sensing, and light detection and ranging. Here, we introduce a NLOS imaging technique termed ptychographic NLOS (pNLOS), which leverages coded ptychography for depth-resolved imaging of obscured objects. Our approach involves scanning a laser spot on a wall to il… ▽ More Non-line-of-sight (NLOS) imaging enables the visualization of objects hidden from direct view, with applications in surveillance, remote sensing, and light detection and ranging. Here, we introduce a NLOS imaging technique termed ptychographic NLOS (pNLOS), which leverages coded ptychography for depth-resolved imaging of obscured objects. Our approach involves scanning a laser spot on a wall to illuminate the hidden objects in an obscured region. The reflected wavefields from these objects then travel back to the wall, get modulated by the wall's complex-valued profile, and the resulting diffraction patterns are captured by a camera. By modulating the object wavefields, the wall surface serves the role of the coded layer as in coded ptychography. As we scan the laser spot to different positions, the reflected object wavefields on the wall translate accordingly, with the shifts varying for objects at different depths. This translational diversity enables the acquisition of a set of modulated diffraction patterns referred to as a ptychogram. By processing the ptychogram, we recover both the objects at different depths and the modulation profile of the wall surface. Experimental results demonstrate high-resolution, high-fidelity imaging of hidden objects, showcasing the potential of pNLOS for depth-aware vision beyond the direct line of sight. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2405.08745 [pdf, other]

Enhancing Blind Video Quality Assessment with Rich Quality-aware Features

Authors: Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhichao Zhang, Linhan Cao, Qiubo Chen, Xiongkuo Min, Weisi Lin, Guangtao Zhai

Abstract: In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQ… ▽ More In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features to help the BVQA model to handle complex distortions and diverse content of social media videos. Specifically, we use SimpleVQA, a BVQA model that consists of a trainable Swin Transformer-B and a fixed SlowFast, as our base model. The Swin Transformer-B and SlowFast components are responsible for extracting spatial and motion features, respectively. Then, we extract three kinds of features from Q-Align, LIQE, and FAST-VQA to capture frame-level quality-aware features, frame-level quality-aware along with scene-specific features, and spatiotemporal quality-aware features, respectively. Through concatenating these features, we employ a multi-layer perceptron (MLP) network to regress them into quality scores. Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets. Moreover, the proposed model won first place in the CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge. The code is available at \url{https://github.com/sunwei925/RQ-VQA.git}. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2404.11313 [pdf, other]

NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Authors: Xin Li, Kun Yuan, Ya**g Pei, Yiting Lu, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhichao Zhang, Linhan Cao, Qiubo Chen, Xiongkuo Min, Weisi Lin, Guangtao Zhai, Jianhui Sun, Tianyi Wang, Lei Li, Han Kong, Wenxuan Wang, Bing Li, Cheng Luo , et al. (43 additional authors not shown)

Abstract: This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The… ▽ More This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR2024 Workshop. The challenge report for CVPR NTIRE2024 Short-form UGC Video Quality Assessment Challenge

arXiv:2404.11278 [pdf, other]

Study on the static detection of ICF target based on muonic X-ray sphere encoded imaging

Authors: Dikai Li, Jian Yu, Qian Chen, Chunhui Zhang, Xiangyu Wan, Leifeng Cao

Abstract: Muon Induced X-ray Emission (MIXE) was discovered by Chinese physicist Zhang Wenyu as early as 1947, and it can conduct non-destructive elemental analysis inside samples. Research has shown that MIXE can retain the high efficiency of direct imaging while benefiting from the low noise of pinhole imaging through encoding holes. The related technology significantly improves the counting rate while ma… ▽ More Muon Induced X-ray Emission (MIXE) was discovered by Chinese physicist Zhang Wenyu as early as 1947, and it can conduct non-destructive elemental analysis inside samples. Research has shown that MIXE can retain the high efficiency of direct imaging while benefiting from the low noise of pinhole imaging through encoding holes. The related technology significantly improves the counting rate while maintaining imaging quality. The sphere encoding technology effectively solves the imaging blurring caused by the tilting of the encoding system, and successfully images micrometer sized X-ray sources. This paper will combine MIXE and X-ray sphere coding imaging techniques, including ball coding and zone plates, to study the method of non-destructive deep structure imaging of ICF targets and obtaining sub element distribution. This method aims to develop a new method for ICF target detection, which is particularly important for inertial confinement fusion. At the same time, this method can be used to detect and analyze materials that are difficult to penetrate or sensitive, and is expected to solve the problem of element resolution and imaging that traditional technologies cannot overcome. It will provide new methods for the future development of multiple fields such as particle physics, material science, and X-ray optics. △ Less

Submitted 17 April, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.01164 [pdf, ps, other]

Unified Predefined-time Stability Conditions of Nonlinear Systems with Lyapunov Analysis

Authors: Bing Xiao, Haichao Zhang, Shijie Zhao, Lu Cao

Abstract: This brief gives a set of unified Lyapunov stability conditions to guarantee the predefined-time/finite-time stability of a dynamical systems. The derived Lyapunov theorem for autonomous systems establishes equivalence with existing theorems on predefined-time/finite-time stability. The findings proposed herein develop a nonsingular sliding mode control framework for an Euler-Lagrange system to an… ▽ More This brief gives a set of unified Lyapunov stability conditions to guarantee the predefined-time/finite-time stability of a dynamical systems. The derived Lyapunov theorem for autonomous systems establishes equivalence with existing theorems on predefined-time/finite-time stability. The findings proposed herein develop a nonsingular sliding mode control framework for an Euler-Lagrange system to analyze its stability, and its upper bound for the settling time can be arbitrarily determined a priori through predefined time constant. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2312.11460 [pdf, other]

Hybrid Internal Model: Learning Agile Legged Locomotion with Simulated Robot Response

Authors: Junfeng Long, Zirui Wang, Quanyi Li, Jiawei Gao, Liu Cao, Jiangmiao Pang

Abstract: Robust locomotion control depends on accurate state estimations. However, the sensors of most legged robots can only provide partial and noisy observations, making the estimation particularly challenging, especially for external states like terrain frictions and elevation maps. Inspired by the classical Internal Model Control principle, we consider these external states as disturbances and introdu… ▽ More Robust locomotion control depends on accurate state estimations. However, the sensors of most legged robots can only provide partial and noisy observations, making the estimation particularly challenging, especially for external states like terrain frictions and elevation maps. Inspired by the classical Internal Model Control principle, we consider these external states as disturbances and introduce Hybrid Internal Model (HIM) to estimate them according to the response of the robot. The response, which we refer to as the hybrid internal embedding, contains the robot's explicit velocity and implicit stability representation, corresponding to two primary goals for locomotion tasks: explicitly tracking velocity and implicitly maintaining stability. We use contrastive learning to optimize the embedding to be close to the robot's successor state, in which the response is naturally embedded. HIM has several appealing benefits: It only needs the robot's proprioceptions, i.e., those from joint encoders and IMU as observations. It innovatively maintains consistent observations between simulation reference and reality that avoids information loss in mimicking learning. It exploits batch-level information that is more robust to noises and keeps better sample efficiency. It only requires 1 hour of training on an RTX 4090 to enable a quadruped robot to traverse any terrain under any disturbances. A wealth of real-world experiments demonstrates its agility, even in high-difficulty tasks and cases never occurred during the training process, revealing remarkable open-world generalizability. △ Less

Submitted 1 January, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: Use 1 hour to train a quadruped robot capable of traversing any terrain under any disturbances in the open world, Project Page: https://github.com/OpenRobotLab/HIMLoco

arXiv:2311.03679 [pdf, other]

Unsupervised convolutional neural network fusion approach for change detection in remote sensing images

Authors: Weidong Yan, Pei Yan, Li Cao

Abstract: With the rapid development of deep learning, a variety of change detection methods based on deep learning have emerged in recent years. However, these methods usually require a large number of training samples to train the network model, so it is very expensive. In this paper, we introduce a completely unsupervised shallow convolutional neural network (USCNN) fusion approach for change detection.… ▽ More With the rapid development of deep learning, a variety of change detection methods based on deep learning have emerged in recent years. However, these methods usually require a large number of training samples to train the network model, so it is very expensive. In this paper, we introduce a completely unsupervised shallow convolutional neural network (USCNN) fusion approach for change detection. Firstly, the bi-temporal images are transformed into different feature spaces by using convolution kernels of different sizes to extract multi-scale information of the images. Secondly, the output features of bi-temporal images at the same convolution kernels are subtracted to obtain the corresponding difference images, and the difference feature images at the same scale are fused into one feature image by using 1 * 1 convolution layer. Finally, the output features of different scales are concatenated and a 1 * 1 convolution layer is used to fuse the multi-scale information of the image. The model parameters are obtained by a redesigned sparse function. Our model has three features: the entire training process is conducted in an unsupervised manner, the network architecture is shallow, and the objective function is sparse. Thus, it can be seen as a kind of lightweight network model. Experimental results on four real remote sensing datasets indicate the feasibility and effectiveness of the proposed approach. △ Less

Submitted 6 November, 2023; originally announced November 2023.

arXiv:2311.02447 [pdf, other]

Quantized-but-uncoded Distributed Detection (QDD) with Unreliable Reporting Channels

Authors: Lei Cao, Ramanarayanan Viswanathan

Abstract: Distributed detection primarily centers around two approaches: Unquantized Distributed Detection (UDD), where each sensor reports its complete observation to the fusion center (FC), and quantized-and-Coded DD (CDD), where each sensor first partitions the observation space and then reports to the FC a codeword. In this paper, we introduce Quantized-but-uncoded DD (QDD), where each sensor, after qua… ▽ More Distributed detection primarily centers around two approaches: Unquantized Distributed Detection (UDD), where each sensor reports its complete observation to the fusion center (FC), and quantized-and-Coded DD (CDD), where each sensor first partitions the observation space and then reports to the FC a codeword. In this paper, we introduce Quantized-but-uncoded DD (QDD), where each sensor, after quantization, transmits a summarized value, instead of a codeword, to the FC. We show that QDD well adapts to the constraint of transmission power when compared to CDD, albeit with increased complexity in parameter selection. Moreover, we establish that, in the presence of independent observations, QDD upholds a necessary condition inherent in CDD. Specifically, the optimal sensor decision rules are the likelihood ratio quantizers (LRQ), irrelevant to the channel conditions. In the context of a single-sensor scenario involving binary decision at the sensor, we find that the optimal sensor rule in QDD is in general no longer ``channel blind", a feature presented in CDD. In addition, we compare these systems numerically under the same transmission power and bandwidth, while assuming additive white Gaussian noise (AWGN) in both sensing and reporting stages. Finally, we present some potential directions for future research. △ Less

Submitted 4 November, 2023; originally announced November 2023.

Comments: 11 pages, 8 figure, submitted to IEEE T-IT

arXiv:2310.16137 [pdf, other]

Codebook-based Uplink Transmission Enhancement in 5G Advanced: Sub-band Precoding

Authors: Liu Cao, Yahia Shabara, Parisa Cheraghi

Abstract: The transformative enhancements of fifth-generation (5G) mobile devices bring about new challenges to achieve better uplink (UL) performance. Particularly, in codebook-based transmission, the wide-band (WB) precoding and the legacy UL codebook may become main bottlenecks for higher efficient data transmission. In this paper, we investigate the codebook-based UL single-layer transmission performanc… ▽ More The transformative enhancements of fifth-generation (5G) mobile devices bring about new challenges to achieve better uplink (UL) performance. Particularly, in codebook-based transmission, the wide-band (WB) precoding and the legacy UL codebook may become main bottlenecks for higher efficient data transmission. In this paper, we investigate the codebook-based UL single-layer transmission performance using fully coherent antenna ports in the context of sub-band (SB) precoding. We analyze the SB precoder selection criteria and design an UL codebook used for SB precoding by increasing the number of relative phase shifts of each port. Via link-level simulations, we verify that the UL SB precoding can improve up to 2 dB performance gain in terms of the block error rate (BLER) compared with the UL WB precoding which is the current UL precoding scheme. We also show that UL performance gain is sensitive to the SB size selection as well as the relative phase shift diversity. △ Less

Submitted 29 October, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: This work has been accepted by IEEE VCC 2023. 5 pages, 7 figures

arXiv:2310.05368 [pdf, other]

Measuring Acoustics with Collaborative Multiple Agents

Authors: Yinfeng Yu, Changan Chen, Lele Cao, Fangkai Yang, Fuchun Sun

Abstract: As humans, we hear sound every second of our life. The sound we hear is often affected by the acoustics of the environment surrounding us. For example, a spacious hall leads to more reverberation. Room Impulse Responses (RIR) are commonly used to characterize environment acoustics as a function of the scene geometry, materials, and source/receiver locations. Traditionally, RIRs are measured by set… ▽ More As humans, we hear sound every second of our life. The sound we hear is often affected by the acoustics of the environment surrounding us. For example, a spacious hall leads to more reverberation. Room Impulse Responses (RIR) are commonly used to characterize environment acoustics as a function of the scene geometry, materials, and source/receiver locations. Traditionally, RIRs are measured by setting up a loudspeaker and microphone in the environment for all source/receiver locations, which is time-consuming and inefficient. We propose to let two robots measure the environment's acoustics by actively moving and emitting/receiving sweep signals. We also devise a collaborative multi-agent policy where these two robots are trained to explore the environment's acoustics while being rewarded for wide exploration and accurate prediction. We show that the robots learn to collaborate and move to explore environment acoustics while minimizing the prediction error. To the best of our knowledge, we present the very first problem formulation and solution to the task of collaborative environment acoustics measurements with multiple agents. △ Less

Submitted 8 October, 2023; originally announced October 2023.

Comments: Main paper (9 pages and 5 figures and 2 tables) and appendix (16 pages and 13 figures and 10 tables). Accepted for publication by IJCAI 2023

arXiv:2309.16680 [pdf, other]

Semi-Persistent Scheduling in NR Sidelink Mode 2: MAC Packet Reception Ratio Model and Validation

Authors: Liu Cao, Sumit Roy, Collin Brady

Abstract: 5G NR Sidelink (SL) has demonstrated the promising capability for infrastructure-less cellular coverage. Understanding the fundamentals of the NR SL channel access mechanism, Semi-Persistent Scheduling (SPS), which is specified by the 3rd Generation Partnership Project (3GPP), is a necessity to enhance the NR SL Packet Reception Ratio (PRR). However, most existing works fail to account for the new… ▽ More 5G NR Sidelink (SL) has demonstrated the promising capability for infrastructure-less cellular coverage. Understanding the fundamentals of the NR SL channel access mechanism, Semi-Persistent Scheduling (SPS), which is specified by the 3rd Generation Partnership Project (3GPP), is a necessity to enhance the NR SL Packet Reception Ratio (PRR). However, most existing works fail to account for the new SPS features introduced in NR SL, which might be out-of-date for comprehensively describing the NR SL PRR. The existing models ignore the relationships between SPS parameters and therefore do not provide sufficient insights into the PRR of SPS. This work proposes a novel SPS PRR model incorporating MAC collisions based on new features in NR SL. We extend our model by loosening several simplifying assumptions made in our initial modeling. The extended models illustrate how the PRR is affected by various SPS parameters. The computed results are validated via simulations using the network simulator (ns-3), which provides important guidelines for future NR SL enhancement work. △ Less

Submitted 26 July, 2023; originally announced September 2023.

Comments: This work has been submitted to the IEEE for possible publication. 13 pages, 21 figures

arXiv:2309.09843 [pdf, other]

Instruction-Following Speech Recognition

Authors: Cheng-I Jeff Lai, Zhiyun Lu, Liangliang Cao, Ruoming Pang

Abstract: Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remai… ▽ More Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks -- ranging from transcript manipulation to summarization -- without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring LLMs or pre-trained speech modules. It also offers selective transcription options based on instructions like "transcribe first half and then turn off listening," providing an additional layer of privacy and safety compared to existing LLMs. Our findings highlight the significant potential of instruction-following training to advance speech foundation models. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2308.03263 [pdf, other]

Prototy** and real-world field trials of RIS-aided wireless communications

Authors: Xilong Pei, Haifan Yin, Li Tan, Lin Cao, Taorui Yang

Abstract: Reconfigurable intelligent surface (RIS) is a promising technology that has the potential to change the way we interact with the wireless propagating environment. In this paper, we design and fabricate an RIS system that can be used in the fifth generation (5G) mobile communication networks. We also propose a practical two-step spatial-oversampling codebook algorithm for the beamforming of RIS, wh… ▽ More Reconfigurable intelligent surface (RIS) is a promising technology that has the potential to change the way we interact with the wireless propagating environment. In this paper, we design and fabricate an RIS system that can be used in the fifth generation (5G) mobile communication networks. We also propose a practical two-step spatial-oversampling codebook algorithm for the beamforming of RIS, which is based on the spatial structure of the wireless channel. This algorithm has much lower complexity compared to the two-dimensional full-space searching-based codebook, yet with only negligible performance loss. Then, a series of experiments are conducted with the fabricated RIS systems, covering the office, corridor, and outdoor environments, in order to verified the effectiveness of RIS in both laboratory and current 5G commercial networks. In the office and corridor scenarios, the 5.8 GHz RIS provided a 10-20 dB power gain at the receiver. In the outdoor test, over 35 dB power gain was observed with RIS compared to the non-deployment case. However, in commercial 5G networks, the 2.6 GHz RIS improved indoor signal strength by only 4-7 dB. The experimental results indicate that RIS achieves higher power gain when transceivers are equipped with directional antennas instead of omni-directional antennas. △ Less

Submitted 6 August, 2023; originally announced August 2023.

Comments: 10 pages, 21 figures

arXiv:2307.02297 [pdf, other]

RIS with insufficient phase shifting capability: Modeling, beamforming, and experimental validations

Authors: Lin Cao, Haifan Yin, Li Tan, Xilong Pei

Abstract: Most research works on reconfigurable intelligent surfaces (RIS) rely on idealized models of the reflection coefficients, i.e., uniform reflection amplitude for any phase and sufficient phase shifting capability. In practice however, such models are oversimplified. This paper introduces a realistic reflection coefficient model for RIS based on measurements. The reflection coefficients are modeled… ▽ More Most research works on reconfigurable intelligent surfaces (RIS) rely on idealized models of the reflection coefficients, i.e., uniform reflection amplitude for any phase and sufficient phase shifting capability. In practice however, such models are oversimplified. This paper introduces a realistic reflection coefficient model for RIS based on measurements. The reflection coefficients are modeled as discrete complex values that have non-uniform amplitudes and suffer from insufficient phase shift capability. We then propose a group-based query algorithm that takes the imperfect coefficients into consideration while calculating the reflection coefficients. We analyze the performance of the proposed algorithm, and derive the closed-form expressions to characterize the received power of an RIS-aided wireless communication system. The performance gains of the proposed algorithm are confirmed in simulations. Furthermore, we validate the proposed theoretical results by experiments with our fabricated RIS prototype systems. The simulation and measurement results match well with the theoretical analysis. △ Less

Submitted 16 April, 2024; v1 submitted 5 July, 2023; originally announced July 2023.

Comments: 13 pages, 11 figures

arXiv:2303.12693 [pdf, other]

Resilient Output Containment Control of Heterogeneous Multiagent Systems Against Composite Attacks: A Digital Twin Approach

Authors: Yukang Cui, Lingbo Cao, Michael V. Basin, Jun Shen, Tingwen Huang, Xin Gong

Abstract: This paper studies the distributed resilient output containment control of heterogeneous multiagent systems against composite attacks, including denial-of-services (DoS) attacks, false-data injection (FDI) attacks, camouflage attacks, and actuation attacks. Inspired by digital twins, a twin layer (TL) with higher security and privacy is used to decouple the above problem into two tasks: defense pr… ▽ More This paper studies the distributed resilient output containment control of heterogeneous multiagent systems against composite attacks, including denial-of-services (DoS) attacks, false-data injection (FDI) attacks, camouflage attacks, and actuation attacks. Inspired by digital twins, a twin layer (TL) with higher security and privacy is used to decouple the above problem into two tasks: defense protocols against DoS attacks on TL and defense protocols against actuation attacks on cyber-physical layer (CPL). First, considering modeling errors of leader dynamics, we introduce distributed observers to reconstruct the leader dynamics for each follower on TL under DoS attacks. Second, distributed estimators are used to estimate follower states according to the reconstructed leader dynamics on the TL. Third, according to the reconstructed leader dynamics, we design decentralized solvers that calculate the output regulator equations on CPL. Fourth, decentralized adaptive attack-resilient control schemes that resist unbounded actuation attacks are provided on CPL. Furthermore, we apply the above control protocols to prove that the followers can achieve uniformly ultimately bounded (UUB) convergence, and the upper bound of the UUB convergence is determined explicitly. Finally, two simulation examples are provided to show the effectiveness of the proposed control protocols. △ Less

Submitted 22 March, 2023; originally announced March 2023.

arXiv:2303.02938 [pdf, other]

RIS-aided Wireless Communications: Can RIS Beat Metal Plate?

Authors: Jiangfeng Hu, Haifan Yin, Li Tan, Lin Cao, Xilong Pei

Abstract: Reconfigurable Intelligent Surface (RIS) has recently been regarded as a paradigm-shifting technology beyond 5G, for its flexibility on smartly adjusting the response to the im**ing electromagnetic (EM) waves. Usually, RIS can be implemented by properly reconfiguring the adjustable parameters of each RIS unit to align the signal phase on the receiver side. And it is believed that the phase align… ▽ More Reconfigurable Intelligent Surface (RIS) has recently been regarded as a paradigm-shifting technology beyond 5G, for its flexibility on smartly adjusting the response to the im**ing electromagnetic (EM) waves. Usually, RIS can be implemented by properly reconfiguring the adjustable parameters of each RIS unit to align the signal phase on the receiver side. And it is believed that the phase alignment can be also mechanically achieved by a metal plate with the same physical size. However, we found in the prototype experiments that, a well-rotated metal plate can only approximately perform as well as RIS under limited conditions, although its scattering efficiency is relatively higher. When it comes to the case of spherical wave im**ing, RIS outperforms the metal plate even beyond the receiving near-field regions. We analyze this phenomenon with wave optics theory and propose explicit scattering models for both the metal plate and RIS in general scenarios. Finally, the models are validated by simulations and field measurements. △ Less

Submitted 6 March, 2023; originally announced March 2023.

Comments: 5 pages, 5 figures

arXiv:2301.11535 [pdf, other]

doi 10.1109/TKDE.2023.3323956

Learning Informative Representation for Fairness-aware Multivariate Time-series Forecasting: A Group-based Perspective

Authors: Hui He, Qi Zhang, Shou** Wang, Kun Yi, Zhendong Niu, Longbing Cao

Abstract: Performance unfairness among variables widely exists in multivariate time series (MTS) forecasting models since such models may attend/bias to certain (advantaged) variables. Addressing this unfairness problem is important for equally attending to all variables and avoiding vulnerable model biases/risks. However, fair MTS forecasting is challenging and has been less studied in the literature. To b… ▽ More Performance unfairness among variables widely exists in multivariate time series (MTS) forecasting models since such models may attend/bias to certain (advantaged) variables. Addressing this unfairness problem is important for equally attending to all variables and avoiding vulnerable model biases/risks. However, fair MTS forecasting is challenging and has been less studied in the literature. To bridge such significant gap, we formulate the fairness modeling problem as learning informative representations attending to both advantaged and disadvantaged variables. Accordingly, we propose a novel framework, named FairFor, for fairness-aware MTS forecasting. FairFor is based on adversarial learning to generate both group-independent and group-relevant representations for the downstream forecasting. The framework first leverages a spectral relaxation of the K-means objective to infer variable correlations and thus to group variables. Then, it utilizes a filtering&fusion component to filter the group-relevant information and generate group-independent representations via orthogonality regularization. The group-independent and group-relevant representations form highly informative representations, facilitating to sharing knowledge from advantaged variables to disadvantaged variables to guarantee fairness. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed FairFor for fair forecasting and significant performance improvement. △ Less

Submitted 23 October, 2023; v1 submitted 26 January, 2023; originally announced January 2023.

Comments: 13 pages, 5 figures, accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE)

MSC Class: 68Txx ACM Class: I.2.6

arXiv:2301.02784 [pdf, other]

Active Fault Isolation for Discrete Event Systems

Authors: Lin Cao, Shaolong Shu, Feng Lin

Abstract: In practice, we can not only disable some events, but also enforce the occurrence of some events prior to the occurrence of other events by external control. In this paper, we combine these two control mechanisms to synthesize a more powerful supervisor. Here our control goal is to design an isolation supervisor which ensures in the closed-loop system, faults are isolatable in the sense that after… ▽ More In practice, we can not only disable some events, but also enforce the occurrence of some events prior to the occurrence of other events by external control. In this paper, we combine these two control mechanisms to synthesize a more powerful supervisor. Here our control goal is to design an isolation supervisor which ensures in the closed-loop system, faults are isolatable in the sense that after a fault occurs, we can determine which type the fault belongs to by observing the output of the closed-loop system. The isolation supervisor starts to work when the occurrence of faults is detected. We then solve the isolation supervisor synthesis problem as follows. For a given discrete event system, we firstly construct a bipartite transition system which includes all feasible isolation supervisors. An isolation supervisor is feasible if it enforces only events that are physically possible. We then develop an algorithm to check whether the synthesis problem is solvable or not. The algorithm can also be used to find a valid isolation supervisor if the synthesis problem is solvable. The method of combining two control mechanisms can be used to synthesize more powerful supervisors for other supervisory control problems of discrete event systems as well. △ Less

Submitted 7 January, 2023; originally announced January 2023.

MSC Class: 93B99 ACM Class: G.2; H.4

arXiv:2301.00656 [pdf, other]

TriNet: stabilizing self-supervised learning from complete or slow collapse on ASR

Authors: Lixin Cao, Jun Wang, Ben Yang, Dan Su, Dong Yu

Abstract: Self-supervised learning (SSL) models confront challenges of abrupt informational collapse or slow dimensional collapse. We propose TriNet, which introduces a novel triple-branch architecture for preventing collapse and stabilizing the pre-training. TriNet learns the SSL latent embedding space and incorporates it to a higher level space for predicting pseudo target vectors generated by a frozen te… ▽ More Self-supervised learning (SSL) models confront challenges of abrupt informational collapse or slow dimensional collapse. We propose TriNet, which introduces a novel triple-branch architecture for preventing collapse and stabilizing the pre-training. TriNet learns the SSL latent embedding space and incorporates it to a higher level space for predicting pseudo target vectors generated by a frozen teacher. Our experimental results show that the proposed method notably stabilizes and accelerates pre-training and achieves a relative word error rate reduction (WERR) of 6.06% compared to the state-of-the-art (SOTA) Data2vec for a downstream benchmark ASR task. We will release our code at https://github.com/tencent-ailab/. △ Less

Submitted 14 March, 2023; v1 submitted 12 December, 2022; originally announced January 2023.

Comments: Accepted by ICASSP 2023

arXiv:2210.13740 [pdf, other]

Latency-aware End-to-end Multi-path Data Transmission for URLLC Services

Authors: Liu Cao, Abbas Kiani, Amanda Xiang, Kaippallimalil John, Tony Saboorian

Abstract: 5th Generation Mobile Communication Technology (5G) utilizes the Access Traffic Steering, Switching, and Splitting (ATSSS) rule to enable multi-path data transmission, which is currently being standardized. Recently, the 3rd Generation Partnership Project (3GPP) SA1 and SA2 have been working on the multi-path solution for possible improvement from different perspectives. However, the existing 3GPP… ▽ More 5th Generation Mobile Communication Technology (5G) utilizes the Access Traffic Steering, Switching, and Splitting (ATSSS) rule to enable multi-path data transmission, which is currently being standardized. Recently, the 3rd Generation Partnership Project (3GPP) SA1 and SA2 have been working on the multi-path solution for possible improvement from different perspectives. However, the existing 3GPP multi-path solution has some limitations on ultra-reliable low-latency communication (URLLC) traffic in terms of reliability and latency requirements. In order to capture the potential gains of multi-path architecture in the context of URLLC services, this paper proposes a novel traffic splitting technique that can more efficiently enjoy the benefit of multi-path architecture in reducing user equipment (UE) uplink (UL) end-to-end (E2E) latency. In particular, we formulate an optimization framework that minimizes user's UL E2E latency via the joint optimization on the ratio of traffic assigned to each path and their corresponding transmit power. The performance of the proposed scheme is evaluated via well-designed simulations. △ Less

Submitted 21 October, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

Comments: This work has been submitted to the IEEE for possible publication. 5 pages, 6 figures

arXiv:2210.01353 [pdf, other]

Pay Self-Attention to Audio-Visual Navigation

Authors: Yinfeng Yu, Lele Cao, Fuchun Sun, Xiaohong Liu, Liejun Wang

Abstract: Audio-visual embodied navigation, as a hot research topic, aims training a robot to reach an audio target using egocentric visual (from the sensors mounted on the robot) and audio (emitted from the target) input. The audio-visual information fusion strategy is naturally important to the navigation performance, but the state-of-the-art methods still simply concatenate the visual and audio features,… ▽ More Audio-visual embodied navigation, as a hot research topic, aims training a robot to reach an audio target using egocentric visual (from the sensors mounted on the robot) and audio (emitted from the target) input. The audio-visual information fusion strategy is naturally important to the navigation performance, but the state-of-the-art methods still simply concatenate the visual and audio features, potentially ignoring the direct impact of context. Moreover, the existing approaches requires either phase-wise training or additional aid (e.g. topology graph and sound semantics). Up till this date, the work that deals with the more challenging setup with moving target(s) is still rare. As a result, we propose an end-to-end framework FSAAVN (feature self-attention audio-visual navigation) to learn chasing after a moving audio target using a context-aware audio-visual fusion strategy implemented as a self-attention module. Our thorough experiments validate the superior performance (both quantitatively and qualitatively) of FSAAVN in comparison with the state-of-the-arts, and also provide unique insights about the choice of visual modalities, visual/audio encoder backbones and fusion patterns. △ Less

Submitted 5 October, 2022; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: Main paper (10 pages and 7 figures) and appendix (21 figures and 4 tables). Accepted for publication by BMVC 2022. For data and code, see https://yyf17.github.io/FSAAVN/index.html

arXiv:2209.02944 [pdf, other]

Architecture-Algorithmic Trade-offs in Multi-path Channel Estimation for mmWAVE Systems

Authors: Lyutianyang Zhang, Sumit Roy, Liu Cao

Abstract: 5G mmWave massive MIMO systems are likely to be deployed in dense urban scenarios, where increasing network capacity is the primary objective. A key component in mmWave transceiver design is channel estimation which is challenging due to the very large signal bandwidths (order of GHz) implying significant resolved spatial multipath, coupled with large # of Tx/Rx antennas for large-scale MIMO. This… ▽ More 5G mmWave massive MIMO systems are likely to be deployed in dense urban scenarios, where increasing network capacity is the primary objective. A key component in mmWave transceiver design is channel estimation which is challenging due to the very large signal bandwidths (order of GHz) implying significant resolved spatial multipath, coupled with large # of Tx/Rx antennas for large-scale MIMO. This results in significantly increased training overhead that in turn leads to unacceptably high computational complexity and power cost. Our work thus highlights the interplay of transceiver architecture and receiver signal processing algorithm choices that fundamentally address (mobile) handset power consumption, with minimal degradation in performance. We investigate trade-offs enabled by conjunction of hybrid beamforming mmWave receiver and channel estimation algorithms that exploit available sparsity in such wideband scenarios. A compressive sensing (CS) framework for sparse channel estimation -- Binary Iterative Hard Thresholding (BIHT) \cite{jacques2013robust} followed by linear reconstruction method with varying quantization (ADC) levels -- is explored to compare the trade-offs between bit-depth and sampling rate for a given ADC power budget. Performance analysis of the BIHT+ linear reconstruction method is conducted via simulation studies for 5G specified multi-path channel models and compared to oracle-assisted bounds for validation. △ Less

Submitted 7 September, 2022; originally announced September 2022.

arXiv:2206.12046 [pdf, other]

Bilateral Network with Channel Splitting Network and Transformer for Thermal Image Super-Resolution

Authors: Bo Yan, Leilei Cao, Fengliang Qi, Hongbin Wang

Abstract: In recent years, the Thermal Image Super-Resolution (TISR) problem has become an attractive research topic. TISR would been used in a wide range of fields, including military, medical, agricultural and animal ecology. Due to the success of PBVS-2020 and PBVS-2021 workshop challenge, the result of TISR keeps improving and attracts more researchers to sign up for PBVS-2022 challenge. In this paper,… ▽ More In recent years, the Thermal Image Super-Resolution (TISR) problem has become an attractive research topic. TISR would been used in a wide range of fields, including military, medical, agricultural and animal ecology. Due to the success of PBVS-2020 and PBVS-2021 workshop challenge, the result of TISR keeps improving and attracts more researchers to sign up for PBVS-2022 challenge. In this paper, we will introduce the technical details of our submission to PBVS-2022 challenge designing a Bilateral Network with Channel Splitting Network and Transformer(BN-CSNT) to tackle the TISR problem. Firstly, we designed a context branch based on channel splitting network with transformer to obtain sufficient context information. Secondly, we designed a spatial branch with shallow transformer to extract low level features which can preserve the spatial information. Finally, for the context branch in order to fuse the features from channel splitting network and transformer, we proposed an attention refinement module, and then features from context branch and spatial branch are fused by proposed feature fusion module. The proposed method can achieve PSNR=33.64, SSIM=0.9263 for x4 and PSNR=21.08, SSIM=0.7803 for x2 in the PBVS-2022 challenge test dataset. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: The second place solution for CVPR2022 PBVS-TISR challenge

arXiv:2206.11378 [pdf, other]

doi 10.1109/JSYST.2022.3183199

Multi-Access Point Coordination for Next-Gen Wi-Fi Networks Aided by Deep Reinforcement Learning

Authors: Lyutianyang Zhang, Hao Yin, Sumit Roy, Liu Cao

Abstract: Wi-Fi in the enterprise - characterized by overlap** Wi-Fi cells - constitutes the design challenge for next-generation networks. Standardization for recently started IEEE 802.11be (Wi-Fi 7) Working Groups has focused on significant medium access control layer changes that emphasize the role of the access point (AP) in radio resource management (RRM) for coordinating channel access due to the hi… ▽ More Wi-Fi in the enterprise - characterized by overlap** Wi-Fi cells - constitutes the design challenge for next-generation networks. Standardization for recently started IEEE 802.11be (Wi-Fi 7) Working Groups has focused on significant medium access control layer changes that emphasize the role of the access point (AP) in radio resource management (RRM) for coordinating channel access due to the high collision probability with the distributed coordination function (DCF), especially in dense overlap** Wi-Fi networks. This paper proposes a novel multi-AP coordination system architecture aided by a centralized AP controller (APC). Meanwhile, a deep reinforcement learning channel access (DLCA) protocol is developed to replace the binary exponential backoff mechanism in DCF to enhance the network throughput by enabling the coordination of APs. First-Order Model-Agnostic Meta-Learning further enhances the network throughput. Subsequently, we also put forward a new greedy algorithm to maintain proportional fairness (PF) among multiple APs. Via the simulation, the performance of DLCA protocol in dense overlap** Wi-Fi networks is verified to have strong stability and outperform baselines such as Shared Transmission Opportunity (SH-TXOP) and Request-to-Send/Clear-to-Send (RTS/CTS) in terms of the network throughput by 10% and 3% as well as the network utility considering proportional fairness by 28.3% and 13.8%, respectively. △ Less

Submitted 22 June, 2022; originally announced June 2022.

Comments: To appear in IEEE Systems Journal. 12 pages, 13 figures

arXiv:2205.10897 [pdf, other]

Efficient PHY Layer Abstraction under Imperfect Channel Estimation

Authors: Liu Cao, Lyutianyang Zhang, Sian **, Sumit Roy

Abstract: As most existing work investigate the PHY layer abstraction under an assumption of perfect channel estimation, it may become unreliable if there exists channel estimation error in a real communication system. This letter improves an efficient PHY layer method, EESM-log-SGN PHY layer abstraction, by considering the presence of channel estimation error. We develop two methods for implementing the EE… ▽ More As most existing work investigate the PHY layer abstraction under an assumption of perfect channel estimation, it may become unreliable if there exists channel estimation error in a real communication system. This letter improves an efficient PHY layer method, EESM-log-SGN PHY layer abstraction, by considering the presence of channel estimation error. We develop two methods for implementing the EESM-log-SGN PHY abstraction under imperfect channel estimation. We show that the effective SINR is not impacted by the channel estimation error under multiple-input and single-output (MISO)/single-input and single-output (SISO) configuration, which is also verified by the full PHY simulation. The developed methods are then validated under different orthogonal frequency division multiplexing (OFDM) scenarios. △ Less

Submitted 8 October, 2022; v1 submitted 22 May, 2022; originally announced May 2022.

Comments: Submitted to IEEE Wireless Communications Letters. 5 pages, 7 figures

arXiv:2204.12736 [pdf]

doi 10.1007/978-3-031-20868-3_25

A Multi-Head Convolutional Neural Network With Multi-path Attention improves Image Denoising

Authors: Jiahong Zhang, Meijun Qu, Ye Wang, Lihong Cao

Abstract: Recently, convolutional neural networks (CNNs) and attention mechanisms have been widely used in image denoising and achieved satisfactory performance. However, the previous works mostly use a single head to receive the noisy image, limiting the richness of extracted features. Therefore, a novel CNN with multiple heads (MH) named MHCNN is proposed in this paper, whose heads will receive the input… ▽ More Recently, convolutional neural networks (CNNs) and attention mechanisms have been widely used in image denoising and achieved satisfactory performance. However, the previous works mostly use a single head to receive the noisy image, limiting the richness of extracted features. Therefore, a novel CNN with multiple heads (MH) named MHCNN is proposed in this paper, whose heads will receive the input images rotated by different rotation angles. MH makes MHCNN simultaneously utilize features of rotated images to remove noise. To integrate these features effectively, we present a novel multi-path attention mechanism (MPA). Unlike previous attention mechanisms that handle pixel-level, channel-level, or patch-level features, MPA focuses on features at the image level. Experiments show MHCNN surpasses other state-of-the-art CNN models on additive white Gaussian noise (AWGN) denoising and real-world image denoising. Its peak signal-to-noise ratio (PSNR) results are higher than other networks, such as BRDNet, RIDNet, PAN-Net, and CSANN. The code is accessible at https://github.com/JiaHongZ/MHCNN. △ Less

Submitted 3 November, 2022; v1 submitted 27 April, 2022; originally announced April 2022.

arXiv:2204.06746 [pdf, other]

Information fusion approach for biomass estimation in a plateau mountainous forest using a synergistic system comprising UAS-based digital camera and LiDAR

Authors: Rong Huang, Wei Yao, Zhong Xu, Lin Cao, Xin Shen

Abstract: Forest land plays a vital role in global climate, ecosystems, farming and human living environments. Therefore, forest biomass estimation methods are necessary to monitor changes in the forest structure and function, which are key data in natural resources research. Although accurate forest biomass measurements are important in forest inventory and assessments, high-density measurements that invol… ▽ More Forest land plays a vital role in global climate, ecosystems, farming and human living environments. Therefore, forest biomass estimation methods are necessary to monitor changes in the forest structure and function, which are key data in natural resources research. Although accurate forest biomass measurements are important in forest inventory and assessments, high-density measurements that involve airborne light detection and ranging (LiDAR) at a low flight height in large mountainous areas are highly expensive. The objective of this study was to quantify the aboveground biomass (AGB) of a plateau mountainous forest reserve using a system that synergistically combines an unmanned aircraft system (UAS)-based digital aerial camera and LiDAR to leverage their complementary advantages. In this study, we utilized digital aerial photogrammetry (DAP), which has the unique advantages of speed, high spatial resolution, and low cost, to compensate for the deficiency of forestry inventory using UAS-based LiDAR that requires terrain-following flight for high-resolution data acquisition. Combined with the sparse LiDAR points acquired by using a high-altitude and high-speed UAS for terrain extraction, dense normalized DAP point clouds can be obtained to produce an accurate and high-resolution canopy height model (CHM). Based on the CHM and spectral attributes obtained from multispectral images, we estimated and mapped the AGB of the region of interest with considerable cost efficiency. Our study supports the development of predictive models for large-scale wall-to-wall AGB map** by leveraging the complementarity between DAP and LiDAR measurements. This work also reveals the potential of utilizing a UAS-based digital camera and LiDAR synergistically in a plateau mountainous forest area. △ Less

Submitted 14 April, 2022; originally announced April 2022.

arXiv:2203.02507 [pdf]

Parallel Fourier Ptychography reconstruction

Authors: Guocheng Zhou, Shaohui Zhang, Yao Hu, Lei Cao, Yong Huang, Qun Hao

Abstract: Fourier ptychography has attracted a wide range of focus for its ability of large space-bandwidth-produce, and quantative phase measurement. It is a typical computational imaging technique which refers to optimizing both the imaging hardware and reconstruction algorithms simultaneously. The data redundancy and inverse problem algorithms are the sources of FPM's excellent performance. But at the sa… ▽ More Fourier ptychography has attracted a wide range of focus for its ability of large space-bandwidth-produce, and quantative phase measurement. It is a typical computational imaging technique which refers to optimizing both the imaging hardware and reconstruction algorithms simultaneously. The data redundancy and inverse problem algorithms are the sources of FPM's excellent performance. But at the same time, this large amount of data processing and complex algorithms also greatly reduce the imaging speed. In this article, we propose a parallel Fourier ptychography reconstruction framework consisting of three levels of parallel computing parts and implemented it with both central processing unit (CPU) and compute unified device architecture (CUDA) platform. In the conventional FPM reconstruction framework, the sample image is divided into multiple sub-regions for separately processing because the illumination angles for different subregions are varied for the same LED and different subregions contain different defocus distances due to the non-planar distribution or non-ideal posture of biological sample. We first build a parallel computing sub-framework in spatial domain based on the above-mentioned characteristics. And then, by utilizing the sequential characteristics of different spectrum regions to update, a parallel computing sub-framework in the spectrum domain is carried out in our scheme. The feasibility of the proposed parallel FPM reconstruction framework is verified with different experimental results acquired with the system we built. △ Less

Submitted 3 March, 2022; originally announced March 2022.

Comments: 12 pages with 11 figures

arXiv:2203.00008 [pdf]

Learned end-to-end high-resolution lensless fiber imaging toward intraoperative real-time cancer diagnosis

Authors: Jiachen Wu, Tijue Wang, Ortrud Uckermann, Roberta Galli, Gabriele Schackert, Liangcai Cao, Jürgen Czarske, Robert Kuschmierz

Abstract: Endomicroscopy is indispensable for minimally invasive diagnostics in clinical practice. For optical keyhole monitoring of surgical interventions, high-resolution fiber endoscopic imaging is considered to be very promising, especially in combination with label-free imaging techniques to realize in vivo diagnosis. However, the inherent honeycomb-artifacts of coherent fiber bundles (CFB) reduce the… ▽ More Endomicroscopy is indispensable for minimally invasive diagnostics in clinical practice. For optical keyhole monitoring of surgical interventions, high-resolution fiber endoscopic imaging is considered to be very promising, especially in combination with label-free imaging techniques to realize in vivo diagnosis. However, the inherent honeycomb-artifacts of coherent fiber bundles (CFB) reduce the resolution and limit the clinical applications. We propose an end-to-end lensless fiber imaging scheme toward intraoperative real-time cancer diagnosis. The framework includes resolution enhancement and classification networks that use single-shot fiber bundle images to provide both high-resolution images and tumor diagnosis result. The well-trained resolution enhancement network not only recovers high-resolution features beyond the physical limitations of CFB, but also helps improving tumor recognition rate. Especially for glioblastoma, the resolution enhancement network helps increasing the classification accuracy from 90.8% to 95.6%. The novel technique can enable histological real-time imaging through lensless fiber endoscopy and is promising for rapid and minimal-invasive intraoperative diagnosis in clinics. △ Less

Submitted 28 February, 2022; originally announced March 2022.

arXiv:2202.10239 [pdf, other]

Fourier ptychography multi-parameter neural network with composite physical priori optimization

Authors: Delong Yang, Shaohui Zhang, Chuanjian Zheng, Guocheng Zhou, Lei Cao, Yao Hu, Qun Hao

Abstract: Fourier ptychography microscopy(FP) is a recently developed computational imaging approach for microscopic super-resolution imaging. By turning on each light-emitting-diode (LED) located on different position on the LED array sequentially and acquiring the corresponding images that contain different spatial frequency components, high spatial resolution and quantitative phase imaging can be achieve… ▽ More Fourier ptychography microscopy(FP) is a recently developed computational imaging approach for microscopic super-resolution imaging. By turning on each light-emitting-diode (LED) located on different position on the LED array sequentially and acquiring the corresponding images that contain different spatial frequency components, high spatial resolution and quantitative phase imaging can be achieved in the case of large field-of-view. Nevertheless, FPM has high requirements for the system construction and data acquisition processes, such as precise LEDs position, accurate focusing and appropriate exposure time, which brings many limitations to its practical applications. In this paper, inspired by artificial neural network, we propose a Fourier ptychography multi-parameter neural network (FPMN) with composite physical prior optimization. A hybrid parameter determination strategy combining physical imaging model and data-driven network training is proposed to recover the multi layers of the network corresponding to different physical parameters, including sample complex function, system pupil function, defocus distance, LED array position deviation and illumination intensity fluctuation, etc. Among these parameters, LED array position deviation is recovered based on the features of brightfield to darkfield transition low-resolution images while the others are recovered in the process of training of the neural network. The feasibility and effectiveness of FPMN are verified through simulations and actual experiments. Therefore FPMN can evidently reduce the requirement for practical applications of FPM. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: 13 pages, 12 figures, solving inverse problem of computational imaging by neural network

arXiv:2112.12055 [pdf]

doi 10.1038/s41377-022-00898-2

Quantitative phase imaging through an ultra-thin lensless fiber endoscope

Authors: Jiawei Sun, Jiachen Wu, Song Wu, Liangcai Cao, Ruchi Goswami, Salvatore Girardo, Jochen Guck, Nektarios Koukourakis, Juergen W. Czarske

Abstract: Quantitative phase imaging (QPI) is a label-free technique providing both morphology and quantitative biophysical information in biomedicine. However, applying such a powerful technique to in vivo pathological diagnosis remains challenging. Multi-core fiber bundles (MCFs) enable ultra-thin probes for in vivo imaging, but current MCF imaging techniques are limited to amplitude imaging modalities. W… ▽ More Quantitative phase imaging (QPI) is a label-free technique providing both morphology and quantitative biophysical information in biomedicine. However, applying such a powerful technique to in vivo pathological diagnosis remains challenging. Multi-core fiber bundles (MCFs) enable ultra-thin probes for in vivo imaging, but current MCF imaging techniques are limited to amplitude imaging modalities. We demonstrate a computational lensless microendoscope that uses an ultra-thin bare MCF to perform quantitative phase imaging of biomedical samples with up to 1 μm lateral resolution and nanoscale axial resolution. The incident complex light field at the measurement side is precisely reconstructed from a single-shot far-field speckle pattern at the detection side, enabling digital focusing and 3D volumetric reconstruction without any mechanical movement. The accuracy of the quantitative phase reconstruction is validated by imaging the phase target and hydrogel beads through the MCF. With the proposed imaging modality, 3D imaging of human cancer cells is achieved through the ultra-thin fiber endoscope, promising widespread clinical applications. △ Less

Submitted 6 July, 2022; v1 submitted 22 December, 2021; originally announced December 2021.

Comments: 16pages, 6 figures

arXiv:2111.12758 [pdf]

Lensless multicore-fiber microendoscope for real-time tailored light field generation with phase encoder neural network (CoreNet)

Authors: Jiawei Sun, Jiachen Wu, Nektarios Koukourakis, Robert Kuschmierz, Liangcai Cao, Juergen Czarske

Abstract: The generation of tailored light with multi-core fiber (MCF) lensless microendoscopes is widely used in biomedicine. However, the computer-generated holograms (CGHs) used for such applications are typically generated by iterative algorithms, which demand high computation effort, limiting advanced applications like in vivo optogenetic stimulation and fiber-optic cell manipulation. The random and di… ▽ More The generation of tailored light with multi-core fiber (MCF) lensless microendoscopes is widely used in biomedicine. However, the computer-generated holograms (CGHs) used for such applications are typically generated by iterative algorithms, which demand high computation effort, limiting advanced applications like in vivo optogenetic stimulation and fiber-optic cell manipulation. The random and discrete distribution of the fiber cores induces strong spatial aliasing to the CGHs, hence, an approach that can rapidly generate tailored CGHs for MCFs is highly demanded. We demonstrate a novel phase encoder deep neural network (CoreNet), which can generate accurate tailored CGHs for MCFs at a near video-rate. Simulations show that CoreNet can speed up the computation time by two magnitudes and increase the fidelity of the generated light field compared to the conventional CGH techniques. For the first time, real-time generated tailored CGHs are on-the-fly loaded to the phase-only SLM for dynamic light fields generation through the MCF microendoscope in experiments. This paves the avenue for real-time cell rotation and several further applications that require real-time high-fidelity light delivery in biomedicine. △ Less

Submitted 24 November, 2021; originally announced November 2021.

arXiv:2110.03841 [pdf, ps, other]

Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition

Authors: Zhiyun Lu, Yanwei Pan, Thibault Doutre, Parisa Haghani, Liangliang Cao, Rohit Prabhavalkar, Chao Zhang, Trevor Strohman

Abstract: End-to-end models have achieved state-of-the-art results on several automatic speech recognition tasks. However, they perform poorly when evaluated on long-form data, e.g., minutes long conversational telephony audio. One reason the model fails on long-form speech is that it has only seen short utterances during training. In this paper we study the effect of training utterance length on the word e… ▽ More End-to-end models have achieved state-of-the-art results on several automatic speech recognition tasks. However, they perform poorly when evaluated on long-form data, e.g., minutes long conversational telephony audio. One reason the model fails on long-form speech is that it has only seen short utterances during training. In this paper we study the effect of training utterance length on the word error rate (WER) for RNN-transducer (RNN-T) model. We compare two widely used training objectives, log loss (or RNN-T loss) and minimum word error rate (MWER) loss. We conduct experiments on telephony datasets in four languages. Our experiments show that for both losses, the WER on long-form speech reduces substantially as the training utterance length increases. The average relative WER gain is 15.7% for log loss and 8.8% for MWER loss. When training on short utterances, MWER loss leads to a lower WER than the log loss. Such difference between the two losses diminishes when the input length increases. △ Less

Submitted 1 April, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: submitted to INTERSPEECH 2022

arXiv:2110.03327 [pdf, other]

Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition

Authors: Qiujia Li, Yu Zhang, David Qiu, Yanzhang He, Liangliang Cao, Philip C. Woodland

Abstract: As end-to-end automatic speech recognition (ASR) models reach promising performance, various downstream tasks rely on good confidence estimators for these systems. Recent research has shown that model-based confidence estimators have a significant advantage over using the output softmax probabilities. If the input data to the speech recogniser is from mismatched acoustic and linguistic conditions,… ▽ More As end-to-end automatic speech recognition (ASR) models reach promising performance, various downstream tasks rely on good confidence estimators for these systems. Recent research has shown that model-based confidence estimators have a significant advantage over using the output softmax probabilities. If the input data to the speech recogniser is from mismatched acoustic and linguistic conditions, the ASR performance and the corresponding confidence estimators may exhibit severe degradation. Since confidence models are often trained on the same in-domain data as the ASR, generalising to out-of-domain (OOD) scenarios is challenging. By kee** the ASR model untouched, this paper proposes two approaches to improve the model-based confidence estimators on OOD data: using pseudo transcriptions and an additional OOD language model. With an ASR model trained on LibriSpeech, experiments show that the proposed methods can greatly improve the confidence metrics on TED-LIUM and Switchboard datasets while preserving in-domain performance. Furthermore, the improved confidence estimators are better calibrated on OOD data and can provide a much more reliable criterion for data selection. △ Less

Submitted 2 March, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: Accepted as a conference paper at ICASSP 2022

arXiv:2109.13226 [pdf, other]

doi 10.1109/JSTSP.2022.3182537

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan** Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks. △ Less

Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

arXiv:2109.05496 [pdf]

A Complex Constrained Total Variation Image Denoising Algorithm with Application to Phase Retrieval

Authors: Yunhui Gao, Liangcai Cao

Abstract: This paper considers the constrained total variation (TV) denoising problem for complex-valued images. We extend the definition of TV seminorms for real-valued images to dealing with complex-valued ones. In particular, we introduce two types of complex TV in both isotropic and anisotropic forms. To solve the constrained denoising problem, we adopt a dual approach and derive an accelerated gradient… ▽ More This paper considers the constrained total variation (TV) denoising problem for complex-valued images. We extend the definition of TV seminorms for real-valued images to dealing with complex-valued ones. In particular, we introduce two types of complex TV in both isotropic and anisotropic forms. To solve the constrained denoising problem, we adopt a dual approach and derive an accelerated gradient projection algorithm. We further generalize the proposed denoising algorithm as a key building block of the proximal gradient scheme to solve a vast class of complex constrained optimization problems with TV regularizers. As an example, we apply the proposed algorithmic framework to phase retrieval. We combine the complex TV regularizer with the conventional projection-based method within the constraint complex TV model. Initial results from both simulated and optical experiments demonstrate the validity of the constrained TV model in extracting sparsity priors within complex-valued images, while also utilizing physically tractable constraints that help speed up convergence. △ Less

Submitted 12 September, 2021; originally announced September 2021.

Comments: 11 pages, 7 figures

arXiv:2104.14346 [pdf, other]

Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models

Authors: Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao

Abstract: Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming mode… ▽ More Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming models, a recent study [1] proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions. However, the performance gap between teacher and student WERs remains high. In this paper, we aim to close this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). In particular, we show that, despite being weaker than RNN-T models, CTC models are remarkable teachers. Further, by fusing RNN-T and CTC models together, we build the strongest teachers. The resulting student models drastically improve upon streaming models of previous work [1]: the WER decreases by 41% on Spanish, 27% on Portuguese, and 13% on French. △ Less

Submitted 25 April, 2021; originally announced April 2021.

arXiv:2104.12870 [pdf, other]

Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction

Authors: David Qiu, Yanzhang He, Qiujia Li, Yu Zhang, Liangliang Cao, Ian McGraw

Abstract: Confidence scores are very useful for downstream applications of automatic speech recognition (ASR) systems. Recent works have proposed using neural networks to learn word or utterance confidence scores for end-to-end ASR. In those studies, word confidence by itself does not model deletions, and utterance confidence does not take advantage of word-level training signals. This paper proposes to joi… ▽ More Confidence scores are very useful for downstream applications of automatic speech recognition (ASR) systems. Recent works have proposed using neural networks to learn word or utterance confidence scores for end-to-end ASR. In those studies, word confidence by itself does not model deletions, and utterance confidence does not take advantage of word-level training signals. This paper proposes to jointly learn word confidence, word deletion, and utterance confidence. Empirical results show that multi-task learning with all three objectives improves confidence metrics (NCE, AUC, RMSE) without the need for increasing the model size of the confidence estimation module. Using the utterance-level confidence for rescoring also decreases the word error rates on Google's Voice Search and Long-tail Maps datasets by 3-5% relative, without needing a dedicated neural rescorer. △ Less

Submitted 26 April, 2021; originally announced April 2021.

Comments: Submitted to Interspeech 2021

arXiv:2104.02757 [pdf, other]

Exploring Targeted Universal Adversarial Perturbations to End-to-end ASR Models

Authors: Zhiyun Lu, Wei Han, Yu Zhang, Liangliang Cao

Abstract: Although end-to-end automatic speech recognition (e2e ASR) models are widely deployed in many applications, there have been very few studies to understand models' robustness against adversarial perturbations. In this paper, we explore whether a targeted universal perturbation vector exists for e2e ASR models. Our goal is to find perturbations that can mislead the models to predict the given target… ▽ More Although end-to-end automatic speech recognition (e2e ASR) models are widely deployed in many applications, there have been very few studies to understand models' robustness against adversarial perturbations. In this paper, we explore whether a targeted universal perturbation vector exists for e2e ASR models. Our goal is to find perturbations that can mislead the models to predict the given targeted transcript such as "thank you" or empty string on any input utterance. We study two different attacks, namely additive and prepending perturbations, and their performances on the state-of-the-art LAS, CTC and RNN-T models. We find that LAS is the most vulnerable to perturbations among the three models. RNN-T is more robust against additive perturbations, especially on long utterances. And CTC is robust against both additive and prepending perturbations. To attack RNN-T, we find prepending perturbation is more effective than the additive perturbation, and can mislead the models to predict the same short target on utterances of arbitrary length. △ Less

Submitted 6 April, 2021; originally announced April 2021.

Comments: Submitted to INTERSPEECH 2021

arXiv:2103.14152 [pdf, other]

Residual Energy-Based Models for End-to-End Speech Recognition

Authors: Qiujia Li, Yu Zhang, Bo Li, Liangliang Cao, Philip C. Woodland

Abstract: End-to-end models with auto-regressive decoders have shown impressive results for automatic speech recognition (ASR). These models formulate the sequence-level probability as a product of the conditional probabilities of all individual tokens given their histories. However, the performance of locally normalised models can be sub-optimal because of factors such as exposure bias. Consequently, the m… ▽ More End-to-end models with auto-regressive decoders have shown impressive results for automatic speech recognition (ASR). These models formulate the sequence-level probability as a product of the conditional probabilities of all individual tokens given their histories. However, the performance of locally normalised models can be sub-optimal because of factors such as exposure bias. Consequently, the model distribution differs from the underlying data distribution. In this paper, the residual energy-based model (R-EBM) is proposed to complement the auto-regressive ASR model to close the gap between the two distributions. Meanwhile, R-EBMs can also be regarded as utterance-level confidence estimators, which may benefit many downstream tasks. Experiments on a 100hr LibriSpeech dataset show that R-EBMs can reduce the word error rates (WERs) by 8.2%/6.7% while improving areas under precision-recall curves of confidence scores by 12.6%/28.4% on test-clean/test-other sets. Furthermore, on a state-of-the-art model using self-supervised learning (wav2vec 2.0), R-EBMs still significantly improves both the WER and confidence estimation performance. △ Less

Submitted 23 June, 2021; v1 submitted 25 March, 2021; originally announced March 2021.

Comments: To appear in Proc. Interspeech 2021

arXiv:2103.06716 [pdf, other]

Learning Word-Level Confidence For Subword End-to-End ASR

Authors: David Qiu, Qiujia Li, Yanzhang He, Yu Zhang, Bo Li, Liangliang Cao, Rohit Prabhavalkar, Deepti Bhatia, Wei Li, Ke Hu, Tara N. Sainath, Ian McGraw

Abstract: We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confi… ▽ More We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confidence models, but the non-unique tokenization from word to WP causes inaccurate labels to be generated. This paper proposes and studies two confidence models of increasing complexity to solve this problem. The final model uses self-attention to directly learn word-level confidence without needing subword tokenization, and exploits full context features from multiple hypotheses to improve confidence accuracy. Experiments on Voice Search and long-tail test sets show standard metrics (e.g., NCE, AUC, RMSE) improving substantially. The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model. △ Less

Submitted 11 March, 2021; originally announced March 2021.

Comments: To appear in ICASSP 2021

arXiv:2103.00534 [pdf, other]

doi 10.1109/TCOMM.2021.3116151

RIS-Aided Wireless Communications: Prototy**, Adaptive Beamforming, and Indoor/Outdoor Field Trials

Authors: Xilong Pei, Haifan Yin, Li Tan, Lin Cao, Zhanpeng Li, Kai Wang, Kun Zhang, Emil Björnson

Abstract: The prospects of using a Reconfigurable Intelligent Surface (RIS) to aid wireless communication systems have recently received much attention from academia and industry. Most papers make theoretical studies based on elementary models, while the prototy** of RIS-aided wireless communication and real-world field trials are scarce. In this paper, we describe a new RIS prototype consisting of 1100 c… ▽ More The prospects of using a Reconfigurable Intelligent Surface (RIS) to aid wireless communication systems have recently received much attention from academia and industry. Most papers make theoretical studies based on elementary models, while the prototy** of RIS-aided wireless communication and real-world field trials are scarce. In this paper, we describe a new RIS prototype consisting of 1100 controllable elements working at 5.8 GHz band. We propose an efficient algorithm for configuring the RIS over the air by exploiting the geometrical array properties and a practical receiver-RIS feedback link. In our indoor test, where the transmitter and receiver are separated by a 30 cm thick concrete wall, our RIS prototype provides a 26 dB power gain compared to the baseline case where the RIS is replaced by a copper plate. A 27 dB power gain was observed in the short-distance outdoor measurement. We also carried out long-distance measurements and successfully transmitted a 32 Mbps data stream over 500 m. A 1080p video was live-streamed and it only played smoothly when the RIS was utilized. The power consumption of the RIS is around 1 W. Our paper is vivid proof that the RIS is a very promising technology for future wireless communications. △ Less

Submitted 31 July, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

Comments: 13 pages, 18 figures, submitted

arXiv:2012.02381 [pdf, other]

Generator Pyramid for High-Resolution Image Inpainting

Authors: Leilei Cao, Tong Yang, Yixu Wang, Bo Yan, Yandong Guo

Abstract: Inpainting high-resolution images with large holes challenges existing deep learning based image inpainting methods. We present a novel framework -- PyramidFill for high-resolution image inpainting task, which explicitly disentangles content completion and texture synthesis. PyramidFill attempts to complete the content of unknown regions in a lower-resolution image, and synthesis the textures of u… ▽ More Inpainting high-resolution images with large holes challenges existing deep learning based image inpainting methods. We present a novel framework -- PyramidFill for high-resolution image inpainting task, which explicitly disentangles content completion and texture synthesis. PyramidFill attempts to complete the content of unknown regions in a lower-resolution image, and synthesis the textures of unknown regions in a higher-resolution image, progressively. Thus, our model consists of a pyramid of fully convolutional GANs, wherein the content GAN is responsible for completing contents in the lowest-resolution masked image, and each texture GAN is responsible for synthesizing textures in a higher-resolution image. Since completing contents and synthesising textures demand different abilities from generators, we customize different architectures for the content GAN and texture GAN. Experiments on multiple datasets including CelebA-HQ, Places2 and a new natural scenery dataset (NSHQ) with different resolutions demonstrate that PyramidFill generates higher-quality inpainting results than the state-of-the-art methods. To better assess high-resolution image inpainting methods, we will release NSHQ, high-quality natural scenery images with high-resolution 1920$\times$1080. △ Less

Submitted 3 December, 2020; originally announced December 2020.

Comments: Under review

arXiv:2010.12096 [pdf, other]

Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data

Authors: Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao

Abstract: Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a nov… ▽ More Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher to generate transcripts on an arbitrarily large data set, which is then used to distill knowledge into streaming ASR models. This way, we scale the training of streaming models to up to 3 million hours of YouTube audio. Experiments show that our approach can significantly reduce the word error rate (WER) of RNNT models not only on LibriSpeech but also on YouTube data in four languages. For example, in French, we are able to reduce the WER by 16.4% relatively to a baseline streaming model by leveraging a non-streaming teacher model trained on the same amount of labeled data as the baseline. △ Less

Submitted 21 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

arXiv:2010.11428 [pdf, other]

Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition

Authors: Qiujia Li, David Qiu, Yu Zhang, Bo Li, Yanzhang He, Philip C. Woodland, Liangliang Cao, Trevor Strohman

Abstract: For various speech-related tasks, confidence scores from a speech recogniser are a useful measure to assess the quality of transcriptions. In traditional hidden Markov model-based automatic speech recognition (ASR) systems, confidence scores can be reliably obtained from word posteriors in decoding lattices. However, for an ASR system with an auto-regressive decoder, such as an attention-based seq… ▽ More For various speech-related tasks, confidence scores from a speech recogniser are a useful measure to assess the quality of transcriptions. In traditional hidden Markov model-based automatic speech recognition (ASR) systems, confidence scores can be reliably obtained from word posteriors in decoding lattices. However, for an ASR system with an auto-regressive decoder, such as an attention-based sequence-to-sequence model, computing word posteriors is difficult. An obvious alternative is to use the decoder softmax probability as the model confidence. In this paper, we first examine how some commonly used regularisation methods influence the softmax-based confidence scores and study the overconfident behaviour of end-to-end models. Then we propose a lightweight and effective approach named confidence estimation module (CEM) on top of an existing end-to-end ASR model. Experiments on LibriSpeech show that CEM can mitigate the overconfidence problem and can produce more reliable confidence scores with and without shallow fusion of a language model. Further analysis shows that CEM generalises well to speech from a moderately mismatched domain and can potentially improve downstream tasks such as semi-supervised learning. △ Less

Submitted 23 October, 2020; v1 submitted 22 October, 2020; originally announced October 2020.

Comments: Submitted to ICASSP 2021

arXiv:2007.05927 [pdf, other]

A Three-limb Teleoperated Robotic System with Foot Control for Flexible Endoscopic Surgery

Authors: Yanpei Huang, Wenjie Lai, Lin Cao, Jiajun Liu, Xiaoguo Li, Etienne Burdet, Soo Jay Phee

Abstract: Flexible endoscopy requires high skills to manipulate both the endoscope and associated instruments. In most robotic flexible endoscopic systems, the endoscope and instruments are controlled separately by two operators, which may result in communication errors and inefficient operation. We present a novel teleoperation robotic endoscopic system that can be commanded by a surgeon alone. This 13 deg… ▽ More Flexible endoscopy requires high skills to manipulate both the endoscope and associated instruments. In most robotic flexible endoscopic systems, the endoscope and instruments are controlled separately by two operators, which may result in communication errors and inefficient operation. We present a novel teleoperation robotic endoscopic system that can be commanded by a surgeon alone. This 13 degrees-of-freedom (DoF) system integrates a foot-controlled robotic flexible endoscope and two hand-controlled robotic endoscopic instruments (a robotic grasper and a robotic cauterizing hook). A foot-controlled human-machine interface maps the natural foot gestures to the 4-DoF movements of the endoscope, and two hand-controlled interfaces map the movements of the two hands to the two instruments individually. The proposed robotic system was validated in an ex-vivo experiment carried out by six subjects, where foot control was also compared with a sequential clutch-based hand control scheme. The participants could successfully teleoperate the endoscope and the two instruments to cut the tissues at scattered target areas in a porcine stomach. Foot control yielded 43.7% faster task completion and required less mental effort as compared to the clutch-based hand control scheme. The system introduced in this paper is intuitive for three-limb manipulation even for operators without experience of handling the endoscope and robotic instruments. This three-limb teleoperated robotic system enables one surgeon to intuitively control three endoscopic tools which normally require two operators, leading to reduced manpower, less communication errors, and improved efficiency. △ Less

Submitted 12 July, 2020; originally announced July 2020.

Comments: 9 pages, 11 figures

arXiv:2005.03271 [pdf, other]

RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

Authors: Chung-Cheng Chiu, Arun Narayanan, Wei Han, Rohit Prabhavalkar, Yu Zhang, Navdeep Jaitly, Ruoming Pang, Tara N. Sainath, Patrick Nguyen, Liangliang Cao, Yonghui Wu

Abstract: In recent years, all-neural end-to-end approaches have obtained state-of-the-art results on several challenging automatic speech recognition (ASR) tasks. However, most existing works focus on building ASR models where train and test data are drawn from the same domain. This results in poor generalization characteristics on mismatched-domains: e.g., end-to-end models trained on short segments perfo… ▽ More In recent years, all-neural end-to-end approaches have obtained state-of-the-art results on several challenging automatic speech recognition (ASR) tasks. However, most existing works focus on building ASR models where train and test data are drawn from the same domain. This results in poor generalization characteristics on mismatched-domains: e.g., end-to-end models trained on short segments perform poorly when evaluated on longer utterances. In this work, we analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models in order to identify model components that negatively affect generalization performance. We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlap** inference. On a long-form YouTube test set, when the nonstreaming RNN-T model is trained with shorter segments of data, the proposed combination improves word error rate (WER) from 22.3% to 14.8%; when the streaming RNN-T model trained on short Search queries, the proposed techniques improve WER on the YouTube set from 67.0% to 25.3%. Finally, when trained on Librispeech, we find that dynamic overlap** inference improves WER on YouTube from 99.8% to 33.0%. △ Less

Submitted 23 December, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

Comments: SLT camera-ready version

arXiv:1911.09762 [pdf, other]

Speech Sentiment Analysis via Pre-trained Features from End-to-end ASR Models

Authors: Zhiyun Lu, Liangliang Cao, Yu Zhang, Chung-Cheng Chiu, James Fan

Abstract: In this paper, we propose to use pre-trained features from end-to-end ASR models to solve speech sentiment analysis as a down-stream task. We show that end-to-end ASR features, which integrate both acoustic and text information from speech, achieve promising results. We use RNN with self-attention as the sentiment classifier, which also provides an easy visualization through attention weights to h… ▽ More In this paper, we propose to use pre-trained features from end-to-end ASR models to solve speech sentiment analysis as a down-stream task. We show that end-to-end ASR features, which integrate both acoustic and text information from speech, achieve promising results. We use RNN with self-attention as the sentiment classifier, which also provides an easy visualization through attention weights to help interpret model predictions. We use well benchmarked IEMOCAP dataset and a new large-scale speech sentiment dataset SWBD-sentiment for evaluation. Our approach improves the-state-of-the-art accuracy on IEMOCAP from 66.6% to 71.7%, and achieves an accuracy of 70.10% on SWBD-sentiment with more than 49,500 utterances. △ Less

Submitted 4 March, 2020; v1 submitted 21 November, 2019; originally announced November 2019.

Showing 1–50 of 54 results for author: Cao, L