Search | arXiv e-print repository

GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model

Authors: Yingying Gao, Shilei Zhang, Chao Deng, Junlan Feng

Abstract: Pre-trained speech language models such as HuBERT and WavLM leverage unlabeled speech data for self-supervised learning and offer powerful representations for numerous downstream tasks. Despite the success of these models, their high requirements for memory and computing resource hinder their application on resource restricted devices. Therefore, this paper introduces GenDistiller, a novel knowled… ▽ More Pre-trained speech language models such as HuBERT and WavLM leverage unlabeled speech data for self-supervised learning and offer powerful representations for numerous downstream tasks. Despite the success of these models, their high requirements for memory and computing resource hinder their application on resource restricted devices. Therefore, this paper introduces GenDistiller, a novel knowledge distillation framework which generates the hidden representations of the pre-trained teacher model directly by a much smaller student network. The proposed method takes the previous hidden layer as history and implements a layer-by-layer prediction of the teacher model autoregressively. Experiments on SUPERB reveal the advantage of GenDistiller over the baseline distilling method without an autoregressive framework, with 33% fewer parameters, similar time consumption and better performance on most of the SUPERB tasks. Ultimately, the proposed GenDistiller reduces the size of WavLM by 82%. △ Less

Submitted 21 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2310.13418

arXiv:2406.07801 [pdf, other]

PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models

Authors: Runyan Yang, Huibao Yang, Xiqing Zhang, Tiantian Ye, Ying Liu, Yingying Gao, Shilei Zhang, Chao Deng, Junlan Feng

Abstract: Recently, there have been attempts to integrate various speech processing tasks into a unified model. However, few previous works directly demonstrated that joint optimization of diverse tasks in multitask speech models has positive influence on the performance of individual tasks. In this paper we present a multitask speech model -- PolySpeech, which supports speech recognition, speech synthesis,… ▽ More Recently, there have been attempts to integrate various speech processing tasks into a unified model. However, few previous works directly demonstrated that joint optimization of diverse tasks in multitask speech models has positive influence on the performance of individual tasks. In this paper we present a multitask speech model -- PolySpeech, which supports speech recognition, speech synthesis, and two speech classification tasks. PolySpeech takes multi-modal language model as its core structure and uses semantic representations as speech inputs. We introduce semantic speech embedding tokenization and speech reconstruction methods to PolySpeech, enabling efficient generation of high-quality speech for any given speaker. PolySpeech shows competitiveness across various tasks compared to single-task models. In our experiments, multitask optimization achieves performance comparable to single-task optimization and is especially beneficial for specific tasks. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 5 pages, 2 figures

arXiv:2405.10463 [pdf, other]

Single-shot volumetric fluorescence imaging with neural fields

Authors: Oumeng Zhang, Haowen Zhou, Brandon Y. Feng, Elin M. Larsson, Reinaldo E. Alcalde, Siyuan Yin, Catherine Deng, Changhuei Yang

Abstract: Single-shot volumetric fluorescence (SVF) imaging offers a significant advantage over traditional imaging methods that require scanning across multiple axial planes as it can capture biological processes with high temporal resolution across a large field of view. The key challenges in SVF imaging include requiring sparsity constraints to meet the multiplexing requirements of compressed sensing, el… ▽ More Single-shot volumetric fluorescence (SVF) imaging offers a significant advantage over traditional imaging methods that require scanning across multiple axial planes as it can capture biological processes with high temporal resolution across a large field of view. The key challenges in SVF imaging include requiring sparsity constraints to meet the multiplexing requirements of compressed sensing, eliminating depth ambiguity in the reconstruction, and maintaining high resolution across a large field of view. In this paper, we introduce the QuadraPol point spread function (PSF) combined with neural fields, a novel approach for SVF imaging. This method utilizes a custom polarizer at the back focal plane and a polarization camera to detect fluorescence, effectively encoding the 3D scene within a compact PSF without depth ambiguity. Additionally, we propose a reconstruction algorithm based on the neural fields technique that provides improved reconstruction quality and addresses the inaccuracies of phase retrieval methods used to correct imaging system aberrations. This algorithm combines the accuracy of experimental PSFs with the long depth of field of computationally generated retrieved PSFs. QuadraPol PSF, combined with neural fields, significantly reduces the acquisition time of a conventional fluorescence microscope by approximately 20 times and captures a 100 mm$^3$ cubic volume in one shot. We validate the effectiveness of both our hardware and algorithm through all-in-focus imaging of bacterial colonies on sand surfaces and visualization of plant root morphology. Our approach offers a powerful tool for advancing biological research and ecological studies. △ Less

Submitted 4 June, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

arXiv:2402.12746 [pdf, ps, other]

Plugin Speech Enhancement: A Universal Speech Enhancement Framework Inspired by Dynamic Neural Network

Authors: Yanan Chen, Zihao Cui, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang

Abstract: The expectation to deploy a universal neural network for speech enhancement, with the aim of improving noise robustness across diverse speech processing tasks, faces challenges due to the existing lack of awareness within static speech enhancement frameworks regarding the expected speech in downstream modules. These limitations impede the effectiveness of static speech enhancement approaches in ac… ▽ More The expectation to deploy a universal neural network for speech enhancement, with the aim of improving noise robustness across diverse speech processing tasks, faces challenges due to the existing lack of awareness within static speech enhancement frameworks regarding the expected speech in downstream modules. These limitations impede the effectiveness of static speech enhancement approaches in achieving optimal performance for a range of speech processing tasks, thereby challenging the notion of universal applicability. The fundamental issue in achieving universal speech enhancement lies in effectively informing the speech enhancement module about the features of downstream modules. In this study, we present a novel weighting prediction approach, which explicitly learns the task relationships from downstream training information to address the core challenge of universal speech enhancement. We found the role of deciding whether to employ data augmentation techniques as crucial downstream training information. This decision significantly impacts the expected speech and the performance of the speech enhancement module. Moreover, we introduce a novel speech enhancement network, the Plugin Speech Enhancement (Plugin-SE). The Plugin-SE is a dynamic neural network that includes the speech enhancement module, gate module, and weight prediction module. Experimental results demonstrate that the proposed Plugin-SE approach is competitive or superior to other joint training methods across various downstream tasks. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2401.14421 [pdf, other]

Multi-Agent Based Transfer Learning for Data-Driven Air Traffic Applications

Authors: Chuhao Deng, Hong-Cheol Choi, Hyunsang Park, Inseok Hwang

Abstract: Research in develo** data-driven models for Air Traffic Management (ATM) has gained a tremendous interest in recent years. However, data-driven models are known to have long training time and require large datasets to achieve good performance. To address the two issues, this paper proposes a Multi-Agent Bidirectional Encoder Representations from Transformers (MA-BERT) model that fully considers… ▽ More Research in develo** data-driven models for Air Traffic Management (ATM) has gained a tremendous interest in recent years. However, data-driven models are known to have long training time and require large datasets to achieve good performance. To address the two issues, this paper proposes a Multi-Agent Bidirectional Encoder Representations from Transformers (MA-BERT) model that fully considers the multi-agent characteristic of the ATM system and learns air traffic controllers' decisions, and a pre-training and fine-tuning transfer learning framework. By pre-training the MA-BERT on a large dataset from a major airport and then fine-tuning it to other airports and specific air traffic applications, a large amount of the total training time can be saved. In addition, for newly adopted procedures and constructed airports where no historical data is available, this paper shows that the pre-trained MA-BERT can achieve high performance by updating regularly with little data. The proposed transfer learning framework and MA-BERT are tested with the automatic dependent surveillance-broadcast data recorded in 3 airports in South Korea in 2019. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: 12 pages, 8 figures, submitted for IEEE Transactions on Intelligent Transportation System

arXiv:2311.04534 [pdf, other]

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

Authors: Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Yukun Ma, Hai Yu, Jiaqing Liu, Chong Zhang

Abstract: Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Mask… ▽ More Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld △ Less

Submitted 4 February, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: 5 pages, accepted by ICASSP 2024

arXiv:2310.17664 [pdf, other]

Cascaded Multi-task Adaptive Learning Based on Neural Architecture Search

Authors: Yingying Gao, Shilei Zhang, Zihao Cui, Chao Deng, Junlan Feng

Abstract: Cascading multiple pre-trained models is an effective way to compose an end-to-end system. However, fine-tuning the full cascaded model is parameter and memory inefficient and our observations reveal that only applying adapter modules on cascaded model can not achieve considerable performance as fine-tuning. We propose an automatic and effective adaptive learning method to optimize end-to-end casc… ▽ More Cascading multiple pre-trained models is an effective way to compose an end-to-end system. However, fine-tuning the full cascaded model is parameter and memory inefficient and our observations reveal that only applying adapter modules on cascaded model can not achieve considerable performance as fine-tuning. We propose an automatic and effective adaptive learning method to optimize end-to-end cascaded multi-task models based on Neural Architecture Search (NAS) framework. The candidate adaptive operations on each specific module consist of frozen, inserting an adapter and fine-tuning. We further add a penalty item on the loss to limit the learned structure which takes the amount of trainable parameters into account. The penalty item successfully restrict the searched architecture and the proposed approach is able to search similar tuning scheme with hand-craft, compressing the optimizing parameters to 8.7% corresponding to full fine-tuning on SLURP with an even better performance. △ Less

Submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.13418 [pdf, other]

GenDistiller: Distilling Pre-trained Language Models based on Generative Models

Authors: Yingying Gao, Shilei Zhang, Zihao Cui, Yanhan Xu, Chao Deng, Junlan Feng

Abstract: Self-supervised pre-trained models such as HuBERT and WavLM leverage unlabeled speech data for representation learning and offer significantly improve for numerous downstream tasks. Despite the success of these methods, their large memory and strong computational requirements hinder their application on resource restricted devices. Therefore, this paper introduces GenDistiller, a novel knowledge d… ▽ More Self-supervised pre-trained models such as HuBERT and WavLM leverage unlabeled speech data for representation learning and offer significantly improve for numerous downstream tasks. Despite the success of these methods, their large memory and strong computational requirements hinder their application on resource restricted devices. Therefore, this paper introduces GenDistiller, a novel knowledge distillation framework to distill hidden representations from teacher network based on generative language model. The generative structure enables the proposed model to generate the target teacher hidden layers autoregressively, considering the interactions between hidden layers without instroducing additional inputs. A two-dimensional attention mechanism is implemented to ensure the causality of hidden layers, while preserving bidirectional attention in the time dimension. Experiments reveal the advantage of the generative distiller over the baseline system that predicts the hidden layers of teacher network directly without a generatvie model. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2305.10821 [pdf, other]

Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation

Authors: Yanjie Fu, Meng Ge, Honglong Wang, Nan Li, Haoran Yin, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang

Abstract: Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available. However, most of them neglect to utilize speaker's 2-dimensional (2D) location cues contained in mixture signal, which limits the performance when two sources come from close directions. In this paper, we propose an end-to-end beamforming network for… ▽ More Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available. However, most of them neglect to utilize speaker's 2-dimensional (2D) location cues contained in mixture signal, which limits the performance when two sources come from close directions. In this paper, we propose an end-to-end beamforming network for 2D location guided speech separation merely given mixture signal. It first estimates discriminable direction and 2D location cues, which imply directions the sources come from in multi views of microphones and their 2D coordinates. These cues are then integrated into location-aware neural beamformer, thus allowing accurate reconstruction of two sources' speech signals. Experiments show that our proposed model not only achieves a comprehensive decent improvement compared to baseline systems, but avoids inferior performance on spatial overlap** cases. △ Less

Submitted 2 June, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech 2023. arXiv admin note: substantial text overlap with arXiv:2212.03401

arXiv:2303.13932 [pdf, ps, other]

Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG)

Authors: Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen, Wen Wang, Zhijie Yan, **glin Liu, Yi Ren, Zhou Zhao

Abstract: ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) focuses on prompting a wide range of spoken language processing (SLP) research on meeting transcripts, as SLP applications are critical to improve users' efficiency in gras** important information in meetings. MUG includes five tracks, including topic segmentation, topic-level and session-level extractive summarization, topi… ▽ More ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) focuses on prompting a wide range of spoken language processing (SLP) research on meeting transcripts, as SLP applications are critical to improve users' efficiency in gras** important information in meetings. MUG includes five tracks, including topic segmentation, topic-level and session-level extractive summarization, topic title generation, keyphrase extraction, and action item detection. To facilitate MUG, we construct and release a large-scale meeting dataset, the AliMeeting4MUG Corpus. △ Less

Submitted 24 March, 2023; originally announced March 2023.

Comments: Paper accepted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023), Rhodes, Greece

arXiv:2303.00952 [pdf, other]

Towards Activated Muscle Group Estimation in the Wild

Authors: Kunyu Peng, David Schneider, Alina Roitberg, Kailun Yang, Jiaming Zhang, Chen Deng, Kaiyu Zhang, M. Saquib Sarfraz, Rainer Stiefelhagen

Abstract: In this paper, we tackle the new task of video-based Activated Muscle Group Estimation (AMGE) aiming at identifying active muscle regions during physical activity in the wild. To this intent, we provide the MuscleMap dataset featuring >15K video clips with 135 different activities and 20 labeled muscle groups. This dataset opens the vistas to multiple video-based applications in sports and rehabil… ▽ More In this paper, we tackle the new task of video-based Activated Muscle Group Estimation (AMGE) aiming at identifying active muscle regions during physical activity in the wild. To this intent, we provide the MuscleMap dataset featuring >15K video clips with 135 different activities and 20 labeled muscle groups. This dataset opens the vistas to multiple video-based applications in sports and rehabilitation medicine under flexible environment constraints. The proposed MuscleMap dataset is constructed with YouTube videos, specifically targeting High-Intensity Interval Training (HIIT) physical exercise in the wild. To make the AMGE model applicable in real-life situations, it is crucial to ensure that the model can generalize well to numerous types of physical activities not present during training and involving new combinations of activated muscles. To achieve this, our benchmark also covers an evaluation setting where the model is exposed to activity types excluded from the training set. Our experiments reveal that the generalizability of existing architectures adapted for the AMGE task remains a challenge. Therefore, we also propose a new approach, TransM3E, which employs a multi-modality feature fusion mechanism between both the video transformer model and the skeleton-based graph convolution model with novel cross-modal knowledge distillation executed on multi-classification tokens. The proposed method surpasses all popular video classification models when dealing with both, previously seen and new types of physical activities. The contributed dataset and code will be publicly available at https://github.com/KPeng9510/MuscleMap. △ Less

Submitted 27 April, 2024; v1 submitted 1 March, 2023; originally announced March 2023.

Comments: The contributed dataset and code will be publicly available at https://github.com/KPeng9510/MuscleMap

arXiv:2212.03401 [pdf, other]

MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation

Authors: Yanjie Fu, Haoran Yin, Meng Ge, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang

Abstract: Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we d… ▽ More Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source. The precisely estimated directional embedding provides quite effective spatial discrimination guidance for the neural beamformer to offset the effect of phase wrap**, thus allowing more accurate reconstruction of two sources' speech signals. Experiments show that our proposed MIMO-DBnet not only achieves a comprehensive decent improvement compared to baseline systems, but also maintain the performance on high frequency bands when phase wrap** occurs. △ Less

Submitted 6 December, 2022; originally announced December 2022.

Comments: Submitted to ICASSP 2023

arXiv:2211.00206 [pdf]

A Primary Frequency Control Strategy for Variable-Speed Pumped-Storage Plant in Power Generation Based on Adaptive Model Predictive Control

Authors: Zhenghua Xu, Changhong Deng, Qiuling Yang

Abstract: Variable-speed pumped-storage (VSPS) has great potential in hel** solve the frequency control problem caused by low inertia, owing to its remarkable flexibility beyond conventional fixed-speed one, to make better use of which, a primary frequency control strategy based on adaptive model predictive control (AMPC) is proposed in this paper for VSPS plant in power generation. Variable-speed pumped-storage (VSPS) has great potential in hel** solve the frequency control problem caused by low inertia, owing to its remarkable flexibility beyond conventional fixed-speed one, to make better use of which, a primary frequency control strategy based on adaptive model predictive control (AMPC) is proposed in this paper for VSPS plant in power generation. △ Less

Submitted 31 October, 2022; originally announced November 2022.

Comments: 8 pages, 9 figures

arXiv:2210.09531

The Brain-Inspired Cooperative Shared Control for Brain-Machine Interface

Authors: Shengjie Zheng, Ling Liu, Junjie Yang, Lang Qian, Gang Gao, Xin Chen, Wenqi **, Chunshan Deng, Xiaojian Li

Abstract: In the practical application of brain-machine interface technology, the problem often faced is the low information content and high noise of the neural signals collected by the electrode and the difficulty of decoding by the decoder, which makes it difficult for the robotic to obtain stable instructions to complete the task. The idea based on the principle of cooperative shared control can be achi… ▽ More In the practical application of brain-machine interface technology, the problem often faced is the low information content and high noise of the neural signals collected by the electrode and the difficulty of decoding by the decoder, which makes it difficult for the robotic to obtain stable instructions to complete the task. The idea based on the principle of cooperative shared control can be achieved by extracting general motor commands from brain activity, while the fine details of the movement can be hosted to the robot for completion, or the brain can have complete control. This study proposes a brain-machine interface shared control system based on spiking neural networks for robotic arm movement control and wheeled robots wheel speed control and steering, respectively. The former can reliably control the robotic arm to move to the destination position, while the latter controls the wheeled robots for object tracking and map generation. The results show that the shared control based on brain-inspired intelligence can perform some typical tasks in complex environments and positively improve the fluency and ease of use of brain-machine interaction, and also demonstrate the potential of this control method in clinical applications of brain-machine interfaces. △ Less

Submitted 25 June, 2024; v1 submitted 17 October, 2022; originally announced October 2022.

Comments: This article need to update the corrected figure and data

arXiv:2210.01434 [pdf, ps, other]

Beamforming Design and Trajectory Optimization for UAV-Empowered Adaptable Integrated Sensing and Communication

Authors: Cailian Deng, Xuming Fang, Xianbin Wang

Abstract: Unmanned aerial vehicle (UAV) has high flexibility and controllable mobility, therefore it is considered as a promising enabler for future integrated sensing and communication (ISAC). In this paper, we propose a novel adaptable ISAC (AISAC) mechanism in the UAV-enabled system, where the UAV performs sensing on demand during communication and the sensing duration is configured flexibly according to… ▽ More Unmanned aerial vehicle (UAV) has high flexibility and controllable mobility, therefore it is considered as a promising enabler for future integrated sensing and communication (ISAC). In this paper, we propose a novel adaptable ISAC (AISAC) mechanism in the UAV-enabled system, where the UAV performs sensing on demand during communication and the sensing duration is configured flexibly according to the application requirements rather than kee** the same with the communication duration. Our designed mechanism avoids the excessive sensing and waste of radio resources, therefore improving the resource utilization and system performance. In the UAV-enabled AISAC system, we aim at maximizing the average system throughput by optimizing the communication and sensing beamforming as well as UAV trajectory while guaranteeing the quality-of-service requirements of communication and sensing. To efficiently solve the considered non-convex optimization problem, we first propose an efficient alternating optimization algorithm to optimize the communication and sensing beamforming for a given UAV location, and then develop a low-complexity joint beamforming and UAV trajectory optimization algorithm that sequentially searches the optimal UAV location until reaching the final location. Numerical results validate the superiority of the proposed adaptable mechanism and the effectiveness of the designed algorithm. △ Less

Submitted 4 October, 2022; originally announced October 2022.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2209.13915 [pdf, ps, other]

Joint Optimization of Resource Allocation and Trajectory Control for Mobile Group Users in Fixed-Wing UAV-Enabled Wireless Network

Authors: Xuezhen Yan, Xuming Fang, Cailian Deng, Xianbin Wang

Abstract: Owing to the controlling flexibility and cost-effectiveness, fixed-wing unmanned aerial vehicles (UAVs) are expected to serve as flying base stations (BSs) in the air-ground integrated network. By exploiting the mobility of UAVs, controllable coverage can be provided for mobile group users (MGUs) under challenging scenarios or even somewhere without communication infrastructure. However, in such d… ▽ More Owing to the controlling flexibility and cost-effectiveness, fixed-wing unmanned aerial vehicles (UAVs) are expected to serve as flying base stations (BSs) in the air-ground integrated network. By exploiting the mobility of UAVs, controllable coverage can be provided for mobile group users (MGUs) under challenging scenarios or even somewhere without communication infrastructure. However, in such dual mobility scenario where the UAV and MGUs are all moving, both the non-hovering feature of the fixed-wing UAV and the movement of MGUs will exacerbate the dynamic changes of user scheduling, which eventually leads to the degradation of MGUs' quality-of-service (QoS). In this paper, we propose a fixed-wing UAV-enabled wireless network architecture to provide moving coverage for MGUs. In order to achieve fairness among MGUs, we maximize the minimum average throughput between all users by jointly optimizing the user scheduling, resource allocation, and UAV trajectory control under the constraints on users' QoS requirements, communication resources, and UAV trajectory switching. Considering the optimization problem is mixed-integer non-convex, we decompose it into three optimization subproblems. An efficient algorithm is proposed to solve these three subproblems alternately till the convergence is realized. Simulation results demonstrate that the proposed algorithm can significantly improve the minimum average throughput of MGUs. △ Less

Submitted 28 September, 2022; originally announced September 2022.

Comments: 30 pages, 9 figures

arXiv:2208.13952 [pdf, other]

doi 10.1109/TGRS.2022.3223649

Micro-Vibration Modes Reconstruction Based on Micro-Doppler Coincidence Imaging

Authors: Shuang Liu, Chen** Deng, Chaoran Wang, Zunwang Bo, Shensheng Han, Zihuai Lin

Abstract: Micro-vibration, a ubiquitous nature phenomenon, can be seen as a characteristic feature on the objects, these vibrations always have tiny amplitudes which are much less than the wavelengths of the sensing systems, thus these motions information can only be reflected in the phase item of echo. Normally the conventional radar system can detect these micro vibrations through the time frequency analy… ▽ More Micro-vibration, a ubiquitous nature phenomenon, can be seen as a characteristic feature on the objects, these vibrations always have tiny amplitudes which are much less than the wavelengths of the sensing systems, thus these motions information can only be reflected in the phase item of echo. Normally the conventional radar system can detect these micro vibrations through the time frequency analyzing, but these vibration characteristics can only be reflected by time-frequency spectrum, the spatial distribution of these micro vibrations can not be reconstructed precisely. Ghost imaging (GI), a novel imaging method also known as Coincidence Imaging that originated in the quantum and optical fields, can reconstruct unknown images using computational methods. To reconstruct the spatial distribution of micro vibrations, this paper proposes a new method based on a coincidence imaging system. A detailed model of target micro-vibration is created first, taking into account two categories: discrete and continuous targets. We use the first-order field correlation feature to obtain objective different micro vibration distribution based on the complex target models and time-frequency analysis in this work. △ Less

Submitted 29 August, 2022; originally announced August 2022.

arXiv:2206.12774 [pdf, other]

Meta Auxiliary Learning for Low-resource Spoken Language Understanding

Authors: Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang

Abstract: Spoken language understanding (SLU) treats automatic speech recognition (ASR) and natural language understanding (NLU) as a unified task and usually suffers from data scarcity. We exploit an ASR and NLU joint training method based on meta auxiliary learning to improve the performance of low-resource SLU task by only taking advantage of abundant manual transcriptions of speech data. One obvious adv… ▽ More Spoken language understanding (SLU) treats automatic speech recognition (ASR) and natural language understanding (NLU) as a unified task and usually suffers from data scarcity. We exploit an ASR and NLU joint training method based on meta auxiliary learning to improve the performance of low-resource SLU task by only taking advantage of abundant manual transcriptions of speech data. One obvious advantage of such method is that it provides a flexible framework to implement a low-resource SLU training task without requiring access to any further semantic annotations. In particular, a NLU model is taken as label generation network to predict intent and slot tags from texts; a multi-task network trains ASR task and SLU task synchronously from speech; and the predictions of label generation network are delivered to the multi-task network as semantic targets. The efficiency of the proposed algorithm is demonstrated with experiments on the public CATSLU dataset, which produces more suitable ASR hypotheses for the downstream NLU task. △ Less

Submitted 25 June, 2022; originally announced June 2022.

arXiv:2206.08031 [pdf, other]

A CTC Triggered Siamese Network with Spatial-Temporal Dropout for Speech Recognition

Authors: Yingying Gao, Junlan Feng, Tianrui Wang, Chao Deng, Shilei Zhang

Abstract: Siamese networks have shown effective results in unsupervised visual representation learning. These models are designed to learn an invariant representation of two augmentations for one input by maximizing their similarity. In this paper, we propose an effective Siamese network to improve the robustness of End-to-End automatic speech recognition (ASR). We introduce spatial-temporal dropout to supp… ▽ More Siamese networks have shown effective results in unsupervised visual representation learning. These models are designed to learn an invariant representation of two augmentations for one input by maximizing their similarity. In this paper, we propose an effective Siamese network to improve the robustness of End-to-End automatic speech recognition (ASR). We introduce spatial-temporal dropout to support a more violent disturbance for Siamese-ASR framework. Besides, we also relax the similarity regularization to maximize the similarities of distributions on the frames that connectionist temporal classification (CTC) spikes occur rather than on all of them. The efficiency of the proposed architecture is evaluated on two benchmarks, AISHELL-1 and Librispeech, resulting in 7.13% and 6.59% relative character error rate (CER) and word error rate (WER) reductions respectively. Analysis shows that our proposed approach brings a better uniformity for the trained model and enlarges the CTC spikes obviously. △ Less

Submitted 22 June, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

arXiv:2202.04250 [pdf, other]

GenAD: General Representations of Multivariate Time Seriesfor Anomaly Detection

Authors: Xiaolei Hua, Lin Zhu, Shenglin Zhang, Zeyan Li, Su Wang, Dong Zhou, Shuo Wang, Chao Deng

Abstract: The reliability of wireless base stations in China Mobile is of vital importance, because the cell phone users are connected to the stations and the behaviors of the stations are directly related to user experience. Although the monitoring of the station behaviors can be realized by anomaly detection on multivariate time series, due to complex correlations and various temporal patterns of multivar… ▽ More The reliability of wireless base stations in China Mobile is of vital importance, because the cell phone users are connected to the stations and the behaviors of the stations are directly related to user experience. Although the monitoring of the station behaviors can be realized by anomaly detection on multivariate time series, due to complex correlations and various temporal patterns of multivariate series in large-scale stations, building a general unsupervised anomaly detection model with a higher F1-score remains a challenging task. In this paper, we propose a General representation of multivariate time series for Anomaly Detection(GenAD). First, we pre-train a general model on large-scale wireless base stations with self-supervision, which can be easily transferred to a specific station anomaly detection with a small amount of training data. Second, we employ Multi-Correlation Attention and Time-Series Attention to represent the correlations and temporal patterns of the stations. With the above innovations, GenAD increases F1-score by total 9% on real-world datasets in China Mobile, while the performance does not significantly degrade on public datasets with only 10% of the training data. △ Less

Submitted 8 February, 2022; originally announced February 2022.

arXiv:2111.14220 [pdf, other]

On the Robustness and Generalization of Deep Learning Driven Full Waveform Inversion

Authors: Chengyuan Deng, Youzuo Lin

Abstract: The data-driven approach has been demonstrated as a promising technique to solve complicated scientific problems. Full Waveform Inversion (FWI) is commonly epitomized as an image-to-image translation task, which motivates the use of deep neural networks as an end-to-end solution. Despite being trained with synthetic data, the deep learning-driven FWI is expected to perform well when evaluated with… ▽ More The data-driven approach has been demonstrated as a promising technique to solve complicated scientific problems. Full Waveform Inversion (FWI) is commonly epitomized as an image-to-image translation task, which motivates the use of deep neural networks as an end-to-end solution. Despite being trained with synthetic data, the deep learning-driven FWI is expected to perform well when evaluated with sufficient real-world data. In this paper, we study such properties by asking: how robust are these deep neural networks and how do they generalize? For robustness, we prove the upper bounds of the deviation between the predictions from clean and noisy data. Moreover, we demonstrate an interplay between the noise level and the additional gain of loss. For generalization, we prove a norm-based generalization error upper bound via a stability-generalization framework. Experimental results on seismic FWI datasets corroborate with the theoretical results, shedding light on a better understanding of utilizing Deep Learning for complicated scientific applications. △ Less

Submitted 28 November, 2021; originally announced November 2021.

arXiv:2111.02926 [pdf, other]

OpenFWI: Large-Scale Multi-Structural Benchmark Datasets for Seismic Full Waveform Inversion

Authors: Chengyuan Deng, Shihang Feng, Hanchen Wang, Xitong Zhang, Peng **, Yinan Feng, Qili Zeng, Yinpeng Chen, Youzuo Lin

Abstract: Full waveform inversion (FWI) is widely used in geophysics to reconstruct high-resolution velocity maps from seismic data. The recent success of data-driven FWI methods results in a rapidly increasing demand for open datasets to serve the geophysics community. We present OpenFWI, a collection of large-scale multi-structural benchmark datasets, to facilitate diversified, rigorous, and reproducible… ▽ More Full waveform inversion (FWI) is widely used in geophysics to reconstruct high-resolution velocity maps from seismic data. The recent success of data-driven FWI methods results in a rapidly increasing demand for open datasets to serve the geophysics community. We present OpenFWI, a collection of large-scale multi-structural benchmark datasets, to facilitate diversified, rigorous, and reproducible research on FWI. In particular, OpenFWI consists of 12 datasets (2.1TB in total) synthesized from multiple sources. It encompasses diverse domains in geophysics (interface, fault, CO2 reservoir, etc.), covers different geological subsurface structures (flat, curve, etc.), and contains various amounts of data samples (2K - 67K). It also includes a dataset for 3D FWI. Moreover, we use OpenFWI to perform benchmarking over four deep learning methods, covering both supervised and unsupervised learning regimes. Along with the benchmarks, we implement additional experiments, including physics-driven methods, complexity analysis, generalization study, uncertainty quantification, and so on, to sharpen our understanding of datasets and methods. The studies either provide valuable insights into the datasets and the performance, or uncover their current limitations. We hope OpenFWI supports prospective research on FWI and inspires future open-source efforts on AI for science. All datasets and related information can be accessed through our website at https://openfwi-lanl.github.io/ △ Less

Submitted 23 June, 2023; v1 submitted 4 November, 2021; originally announced November 2021.

Comments: This manuscript has been accepted by NeurIPS 2022 dataset and benchmark track

arXiv:2106.15765 [pdf, other]

doi 10.1364/PRJ.435256

10-mega pixel snapshot compressive imaging with a hybrid coded aperture

Authors: Zhihong Zhang, Chao Deng, Yang Liu, Xin Yuan, **li Suo, Qionghai Dai

Abstract: High resolution images are widely used in our daily life, whereas high-speed video capture is challenging due to the low frame rate of cameras working at the high resolution mode. Digging deeper, the main bottleneck lies in the low throughput of existing imaging systems. Towards this end, snapshot compressive imaging (SCI) was proposed as a promising solution to improve the throughput of imaging s… ▽ More High resolution images are widely used in our daily life, whereas high-speed video capture is challenging due to the low frame rate of cameras working at the high resolution mode. Digging deeper, the main bottleneck lies in the low throughput of existing imaging systems. Towards this end, snapshot compressive imaging (SCI) was proposed as a promising solution to improve the throughput of imaging systems by compressive sampling and computational reconstruction. During acquisition, multiple high-speed images are encoded and collapsed to a single measurement. After this, algorithms are employed to retrieve the video frames from the coded snapshot. Recently developed Plug-and-Play (PnP) algorithms make it possible for SCI reconstruction in large-scale problems. However, the lack of high-resolution encoding systems still precludes SCI's wide application. In this paper, we build a novel hybrid coded aperture snapshot compressive imaging (HCA-SCI) system by incorporating a dynamic liquid crystal on silicon and a high-resolution lithography mask. We further implement a PnP reconstruction algorithm with cascaded denoisers for high quality reconstruction. Based on the proposed HCA-SCI system and algorithm, we achieve a 10-mega pixel SCI system to capture high-speed scenes, leading to a high throughput of 4.6G voxels per second. Both simulation and real data experiments verify the feasibility and performance of our proposed HCA-SCI scheme. △ Less

Submitted 15 August, 2021; v1 submitted 29 June, 2021; originally announced June 2021.

Comments: 11 pages, 8 figures, accepted by Photonics Research

arXiv:2011.02109 [pdf]

Deep Multi-task Network for Delay Estimation and Echo Cancellation

Authors: Yi Zhang, Chengyun Deng, Shiqian Ma, Yongtao Sha, Hui Song

Abstract: Echo path delay (or ref-delay) estimation is a big challenge in acoustic echo cancellation. Different devices may introduce various ref-delay in practice. Ref-delay inconsistency slows down the convergence of adaptive filters, and also degrades the performance of deep learning models due to 'unseen' ref-delays in the training set. In this paper, a multi-task network is proposed to address both ref… ▽ More Echo path delay (or ref-delay) estimation is a big challenge in acoustic echo cancellation. Different devices may introduce various ref-delay in practice. Ref-delay inconsistency slows down the convergence of adaptive filters, and also degrades the performance of deep learning models due to 'unseen' ref-delays in the training set. In this paper, a multi-task network is proposed to address both ref-delay estimation and echo cancellation tasks. The proposed architecture consists of two convolutional recurrent networks (CRNNs) to estimate the echo and enhanced signals separately, as well as a fully-connected (FC) network to estimate the echo path delay. Echo signal is first predicted, and then is combined with reference signal together for delay estimation. At the end, delay compensated reference and microphone signals are used to predict the enhanced target signal. Experimental results suggest that the proposed method makes reliable delay estimation and outperforms the existing state-of-the-art solutions in inconsistent echo path delay scenarios, in terms of echo return loss enhancement (ERLE) and perceptual evaluation of speech quality (PESQ). Furthermore, a data augmentation method is studied to evaluate the model performance on different portion of synthetical data with artificially introduced ref-delay. △ Less

Submitted 11 August, 2022; v1 submitted 3 November, 2020; originally announced November 2020.

Comments: Accepted by Interspeech 2020

arXiv:2011.02102 [pdf, other]

Robust Speaker Extraction Network Based on Iterative Refined Adaptation

Authors: Chengyun Deng, Shiqian Ma, Yi Zhang, Yongtao Sha, Hui Zhang, Hui Song, Xiangang Li

Abstract: Speaker extraction aims to extract target speech signal from a multi-talker environment with interference speakers and surrounding noise, given the target speaker's reference information. Most speaker extraction systems achieve satisfactory performance on the premise that the test speakers have been encountered during training time. Such systems suffer from performance degradation given unseen tar… ▽ More Speaker extraction aims to extract target speech signal from a multi-talker environment with interference speakers and surrounding noise, given the target speaker's reference information. Most speaker extraction systems achieve satisfactory performance on the premise that the test speakers have been encountered during training time. Such systems suffer from performance degradation given unseen target speakers and/or mismatched reference voiceprint information. In this paper we propose a novel strategy named Iterative Refined Adaptation (IRA) to improve the robustness and generalization capability of speaker extraction systems in the aforementioned scenarios. Given an initial speaker embedding encoded by an auxiliary network, the extraction network can obtain a latent representation of the target speaker, which is fed back to the auxiliary network to get a refined embedding to provide more accurate guidance for the extraction network. Experiments on WSJ0-2mix-extr and WHAM! dataset confirm the superior performance of the proposed method over the network without IRA in terms of SI-SDR and PESQ improvement. △ Less

Submitted 11 August, 2022; v1 submitted 3 November, 2020; originally announced November 2020.

Comments: Accepted by Interspeech 2021

arXiv:2007.14974 [pdf, other]

On Loss Functions and Recurrency Training for GAN-based Speech Enhancement Systems

Authors: Zhuohuang Zhang, Chengyun Deng, Yi Shen, Donald S. Williamson, Yongtao Sha, Yi Zhang, Hui Song, Xiangang Li

Abstract: Recent work has shown that it is feasible to use generative adversarial networks (GANs) for speech enhancement, however, these approaches have not been compared to state-of-the-art (SOTA) non GAN-based approaches. Additionally, many loss functions have been proposed for GAN-based approaches, but they have not been adequately compared. In this study, we propose novel convolutional recurrent GAN (CR… ▽ More Recent work has shown that it is feasible to use generative adversarial networks (GANs) for speech enhancement, however, these approaches have not been compared to state-of-the-art (SOTA) non GAN-based approaches. Additionally, many loss functions have been proposed for GAN-based approaches, but they have not been adequately compared. In this study, we propose novel convolutional recurrent GAN (CRGAN) architectures for speech enhancement. Multiple loss functions are adopted to enable direct comparisons to other GAN-based systems. The benefits of including recurrent layers are also explored. Our results show that the proposed CRGAN model outperforms the SOTA GAN-based models using the same loss functions and it outperforms other non-GAN based systems, indicating the benefits of using a GAN for speech enhancement. Overall, the CRGAN model that combines an objective metric loss function with the mean squared error (MSE) provides the best performance over comparison approaches across many evaluation metrics. △ Less

Submitted 26 December, 2020; v1 submitted 29 July, 2020; originally announced July 2020.

Comments: accepted by Interspeech2020, 5 pages, 2 figures

arXiv:2007.13401 [pdf, ps, other]

IEEE 802.11be-Wi-Fi 7: New Challenges and Opportunities

Authors: Cailian Deng, Xuming Fang, Xiao Han, Xianbin Wang, Li Yan, Rong He, Yan Long, Yuchen Guo

Abstract: With the emergence of 4k/8k video, the throughput requirement of video delivery will keep grow to tens of Gbps. Other new high-throughput and low-latency video applications including augmented reality (AR), virtual reality (VR), and online gaming, are also proliferating. Due to the related stringent requirements, supporting these applications over wireless local area network (WLAN) is far beyond t… ▽ More With the emergence of 4k/8k video, the throughput requirement of video delivery will keep grow to tens of Gbps. Other new high-throughput and low-latency video applications including augmented reality (AR), virtual reality (VR), and online gaming, are also proliferating. Due to the related stringent requirements, supporting these applications over wireless local area network (WLAN) is far beyond the capabilities of the new WLAN standard -- IEEE 802.11ax. To meet these emerging demands, the IEEE 802.11 will release a new amendment standard IEEE 802.11be -- Extremely High Throughput (EHT), also known as Wireless-Fidelity (Wi-Fi) 7. This article provides the comprehensive survey on the key medium access control (MAC) layer techniques and physical layer (PHY) techniques being discussed in the EHT task group, including the channelization and tone plan, multiple resource units (multi-RU) support, 4096 quadrature amplitude modulation (4096-QAM), preamble designs, multiple link operations (e.g., multi-link aggregation and channel access), multiple input multiple output (MIMO) enhancement, multiple access point (multi-AP) coordination (e.g., multi-AP joint transmission), enhanced link adaptation and retransmission protocols (e.g., hybrid automatic repeat request (HARQ)). This survey covers both the critical technologies being discussed in EHT standard and the related latest progresses from worldwide research. Besides, the potential developments beyond EHT are discussed to provide some possible future research directions for WLAN. △ Less

Submitted 3 August, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

Comments: Accepted for publication in IEEE Communications Surveys and Tutorials

arXiv:1912.01852 [pdf, other]

PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network

Authors: Chengqi Deng, Chengzhu Yu, Heng Lu, Chao Weng, Dong Yu

Abstract: Singing voice conversion is to convert a singer's voice to another one's voice without changing singing content. Recent work shows that unsupervised singing voice conversion can be achieved with an autoencoder-based approach [1]. However, the converted singing voice can be easily out of key, showing that the existing approach cannot model the pitch information precisely. In this paper, we propose… ▽ More Singing voice conversion is to convert a singer's voice to another one's voice without changing singing content. Recent work shows that unsupervised singing voice conversion can be achieved with an autoencoder-based approach [1]. However, the converted singing voice can be easily out of key, showing that the existing approach cannot model the pitch information precisely. In this paper, we propose to advance the existing unsupervised singing voice conversion method proposed in [1] to achieve more accurate pitch translation and flexible pitch manipulation. Specifically, the proposed PitchNet added an adversarially trained pitch regression network to enforce the encoder network to learn pitch invariant phoneme representation, and a separate module to feed pitch extracted from the source audio to the decoder network. Our evaluation shows that the proposed method can greatly improve the quality of the converted singing voice (2.92 vs 3.75 in MOS). We also demonstrate that the pitch of converted singing can be easily controlled during generation by changing the levels of the extracted pitch before passing it to the decoder network. △ Less

Submitted 18 February, 2020; v1 submitted 4 December, 2019; originally announced December 2019.

Comments: Accepted by ICASSP 2020

arXiv:1901.07042 [pdf, other]

MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

Authors: Alistair E. W. Johnson, Tom J. Pollard, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Yifan Peng, Zhiyong Lu, Roger G. Mark, Seth J. Berkowitz, Steven Horng

Abstract: Chest radiography is an extremely powerful imaging modality, allowing for a detailed inspection of a patient's thorax, but requiring specialized training for proper interpretation. With the advent of high performance general purpose computer vision algorithms, the accurate automated analysis of chest radiographs is becoming increasingly of interest to researchers. However, a key challenge in the d… ▽ More Chest radiography is an extremely powerful imaging modality, allowing for a detailed inspection of a patient's thorax, but requiring specialized training for proper interpretation. With the advent of high performance general purpose computer vision algorithms, the accurate automated analysis of chest radiographs is becoming increasingly of interest to researchers. However, a key challenge in the development of these techniques is the lack of sufficient data. Here we describe MIMIC-CXR-JPG v2.0.0, a large dataset of 377,110 chest x-rays associated with 227,827 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 - 2016. Images are provided with 14 labels derived from two natural language processing tools applied to the corresponding free-text radiology reports. MIMIC-CXR-JPG is derived entirely from the MIMIC-CXR database, and aims to provide a convenient processed version of MIMIC-CXR, as well as to provide a standard reference for data splits and image labels. All images have been de-identified to protect patient privacy. The dataset is made freely available to facilitate and encourage a wide range of research in medical computer vision. △ Less

Submitted 14 November, 2019; v1 submitted 21 January, 2019; originally announced January 2019.

arXiv:1811.03455 [pdf, other]

High fidelity single-pixel imaging

Authors: Chao Deng, Xuemei Hu, Xiaoxu Li, **li Suo, Zhili Zhang, Qionghai Dai

Abstract: Single-pixel imaging (SPI) is an emerging technique which has attracts wide attention in various research fields. However, restricted by the low reconstruction quality and large amount of measurements, the practical application is still in its infancy. Inspired by the fact that natural scenes exhibit unique degenerate structures in the low dimensional subspace, we propose to take advantage of the… ▽ More Single-pixel imaging (SPI) is an emerging technique which has attracts wide attention in various research fields. However, restricted by the low reconstruction quality and large amount of measurements, the practical application is still in its infancy. Inspired by the fact that natural scenes exhibit unique degenerate structures in the low dimensional subspace, we propose to take advantage of the local prior in convolutional sparse coding to implement high fidelity single-pixel imaging. Specifically, by statistically learning strategy, the target scene can be sparse represented on an overcomplete dictionary. The dictionary is composed of various basis learned from a natural image database. We introduce the above local prior into conventional SPI framework to promote the final reconstruction quality. Experiments both on synthetic data and real captured data demonstrate that our method can achieve better reconstruction from the same measurements, and thus consequently reduce the number of required measurements for same reconstruction quality. △ Less

Submitted 7 November, 2018; originally announced November 2018.

Comments: 5 pages, 6 figures

arXiv:1708.06933 [pdf]

doi 10.1007/s12043-018-1590-5

On Non-Consensus Motions of Dynamical Linear Multi-Agent Systems

Authors: Ning Cai, Chun-Lin Deng, Qiu-Xuan Wu

Abstract: The non-consensus problems of high order linear time-invariant dynamical homogeneous multi-agent systems are concerned. Based on the conditions of consensus achievement, the mechanisms that lead to non-consensus motions are analyzed. Besides, a comprehensive classification for diverse types of non-consensus phases in accordance to the different conditions is conducted, which is jointly depending o… ▽ More The non-consensus problems of high order linear time-invariant dynamical homogeneous multi-agent systems are concerned. Based on the conditions of consensus achievement, the mechanisms that lead to non-consensus motions are analyzed. Besides, a comprehensive classification for diverse types of non-consensus phases in accordance to the different conditions is conducted, which is jointly depending on the self-dynamics of agents, the interactive protocol and the graph topology. A series of numerical examples are demonstrated to illustrate the theoretical analysis. △ Less

Submitted 23 August, 2017; originally announced August 2017.

Showing 1–31 of 31 results for author: Deng, C