Search | arXiv e-print repository

Active-RIS-Aided Covert Communications in NOMA-Inspired ISAC Wireless Systems

Authors: Miaomiao Zhu, Pengxu Chen, Liang Yang, Alexandros-Apostolos A. Boulogeorgos, Theodoros A. Tsiftsis, Hongwu Liu

Abstract: Non-orthogonal multiple access (NOMA)-inspired integrated sensing and communication (ISAC) facilitates spectrum sharing for radar sensing and NOMA communications, whereas facing privacy and security challenges due to open wireless propagation. In this paper, active reconfigurable intelligent surface (RIS) is employed to aid covert communications in NOMA-inspired ISAC wireless system with the aim o… ▽ More Non-orthogonal multiple access (NOMA)-inspired integrated sensing and communication (ISAC) facilitates spectrum sharing for radar sensing and NOMA communications, whereas facing privacy and security challenges due to open wireless propagation. In this paper, active reconfigurable intelligent surface (RIS) is employed to aid covert communications in NOMA-inspired ISAC wireless system with the aim of maximizing the covert rate. Specifically, a dual-function base-station (BS) transmits the superposition signal to sense multiple targets, while achieving covert and reliable communications for a pair of NOMA covert and public users, respectively, in the presence of a warden. Two superposition transmission schemes, namely, the transmissions with dedicated sensing signal (w-DSS) and without dedicated sensing signal (w/o-DSS), are respectively considered in the formulations of the joint transmission and reflection beamforming optimization problems. Numerical results demonstrate that active-RIS-aided NOMA-ISAC system outperforms the passive-RIS-aided and without-RIS counterparts in terms of covert rate and trade-off between covert communication and sensing performance metrics. Finally, the w/o-DSS scheme, which omits the dedicated sensing signal, achieves a higher covert rate than the w-DSS scheme by allocating more transmit power for the covert transmissions, while preserving a comparable multi-target sensing performance. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.18862 [pdf, other]

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

Authors: Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, Lei Xie

Abstract: Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire s… ▽ More Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted for Interspeech 2024

arXiv:2406.06375 [pdf, other]

doi 10.1109/TASLP.2024.3407529

MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing

Authors: Yu-Fen Huang, Nikki Moran, Simon Coleman, Jon Kelly, Shun-Hwa Wei, Po-Yin Chen, Yun-Hsin Huang, Tsung-** Chen, Yu-Chia Kuo, Yu-Chi Wei, Chih-Hsuan Li, Da-Yu Huang, Hsuan-Kai Kao, Ting-Wei Lin, Li Su

Abstract: In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music m… ▽ More In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrase, and expressive contents from audio, video and motion data, and the generation of musicians' body motion from given music audio. The dataset and codes are available alongside this publication (https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset). △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024. 14 pages, 7 figures. Dataset is available on: https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset/tree/main and https://zenodo.org/records/11393449

arXiv:2406.02733 [pdf, other]

Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation

Authors: Min-Jae Hwang, Ilia Kulikov, Benjamin Peloquin, Hongyu Gong, Peng-Jen Chen, Ann Lee

Abstract: In this paper, we propose a textless acoustic model with a self-supervised distillation strategy for noise-robust expressive speech-to-speech translation (S2ST). Recently proposed expressive S2ST systems have achieved impressive expressivity preservation performances by cascading unit-to-speech (U2S) generator to the speech-to-unit translation model. However, these systems are vulnerable to the pr… ▽ More In this paper, we propose a textless acoustic model with a self-supervised distillation strategy for noise-robust expressive speech-to-speech translation (S2ST). Recently proposed expressive S2ST systems have achieved impressive expressivity preservation performances by cascading unit-to-speech (U2S) generator to the speech-to-unit translation model. However, these systems are vulnerable to the presence of noise in input speech, which is an assumption in real-world translation scenarios. To address this limitation, we propose a U2S generator that incorporates a distillation with no label (DINO) self-supervised training strategy into it's pretraining process. Because the proposed method captures noise-agnostic expressivity representation, it can generate qualified speech even in noisy environment. Objective and subjective evaluation results verified that the proposed method significantly improved the performance of the expressive S2ST system in noisy environments while maintaining competitive performance in clean environments. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted to ACL 2024 (findings)

arXiv:2405.18669 [pdf, other]

Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

Authors: Vicky Zayats, Peter Chen, Melissa Ferrari, Dirk Padfield

Abstract: Integrating multiple generative foundation models, especially those trained on different modalities, into something greater than the sum of its parts poses significant challenges. Two key hurdles are the availability of aligned data (concepts that contain similar meaning but is expressed differently in different modalities), and effectively leveraging unimodal representations in cross-domain gener… ▽ More Integrating multiple generative foundation models, especially those trained on different modalities, into something greater than the sum of its parts poses significant challenges. Two key hurdles are the availability of aligned data (concepts that contain similar meaning but is expressed differently in different modalities), and effectively leveraging unimodal representations in cross-domain generative tasks, without compromising their original unimodal capabilities. We propose Zipper, a multi-tower decoder architecture that addresses these concerns by using cross-attention to flexibly compose multimodal generative models from independently pre-trained unimodal decoders. In our experiments fusing speech and text modalities, we show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data. We also showcase the flexibility of our model to selectively maintain unimodal (e.g., text-to-text generation) generation performance by freezing the corresponding modal tower (e.g. text). In cross-modal tasks such as automatic speech recognition (ASR) where the output modality is text, we show that freezing the text backbone results in negligible performance degradation. In cross-modal tasks such as text-to-speech generation (TTS) where the output modality is speech, we show that using a pre-trained speech backbone results in superior performance to the baseline. △ Less

Submitted 31 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: Under review at NeurIPS

arXiv:2405.16677 [pdf, other]

Crossmodal ASR Error Correction with Discrete Speech Units

Authors: Yuanchao Li, Pinzhen Chen, Peter Bell, Catherine Lai

Abstract: ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with… ▽ More ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with 1-best hypothesis transcription. We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon, shedding light on appropriate training schemes for LROOD data. Moreover, we propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality. Results from multiple corpora and several evaluation metrics demonstrate the feasibility and efficacy of our proposed AEC approach on LROOD data, as well as its generalizability and superiority on large-scale data. Finally, a study on speech emotion recognition confirms that our model produces ASR error-robust transcripts suitable for downstream applications. △ Less

Submitted 26 May, 2024; originally announced May 2024.

arXiv:2405.14161 [pdf, other]

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang

Abstract: We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifica… ▽ More We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifically, we propose a novel indicator that empirically integrates step-wise information during decoding to assess the token-level quality of pseudo labels without ground truth, thereby guiding model updates for effective unsupervised adaptation. Experimental results show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains, and it sometimes even approaches the upper-bound performance of supervised adaptation. Surprisingly, we also observe that STAR prevents the adapted model from the common catastrophic forgetting problem without recalling source-domain data. Furthermore, STAR exhibits high data efficiency that only requires less than one-hour unlabeled data, and seamless generality to alternative large speech models and speech translation tasks. Our code aims to open source to the research communities. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 23 pages, Preprint

arXiv:2405.07281 [pdf, ps, other]

Movable Antennas Aided Multicast MISO Communication Systems

Authors: Zhenqiao Cheng, Nanxi Li, Ruizhe Long, Jianchi Zhu, Chongjun Ouyang, Peng Chen

Abstract: A novel multicast communication system with movable antennas (MAs) is proposed, where the antenna position optimization is exploited to enhance the transmission rate. Specifically, an MA-assisted two-user multicast multiple-input single-input system is considered. The joint optimization of the transmit beamforming vector and transmit MA positions is studied by modeling the motion of the MA element… ▽ More A novel multicast communication system with movable antennas (MAs) is proposed, where the antenna position optimization is exploited to enhance the transmission rate. Specifically, an MA-assisted two-user multicast multiple-input single-input system is considered. The joint optimization of the transmit beamforming vector and transmit MA positions is studied by modeling the motion of the MA elements as discrete movements. A low-complexity greedy search-based algorithm is proposed to tackle this non-convex inter-programming problem. A branch-and-bound (BAB)-based method is proposed to achieve the optimal multicast rate with a reduced time complexity than the brute-force search by assuming the two users suffer similar line-of-sight path losses. Numerical results reveal that the proposed MA systems significantly improve the multicast rate compared to conventional fixed-position antennas (FPAs)-based systems. △ Less

Submitted 12 May, 2024; originally announced May 2024.

Comments: 5 pages

arXiv:2404.18406 [pdf, ps, other]

Movable Antenna-Enhanced Wireless Powered Mobile Edge Computing Systems

Authors: Pengcheng Chen, Yuxuan Yang, Bin Lyu, Zhen Yang, Abbas Jamalipour

Abstract: In this paper, we propose a movable antenna (MA) enhanced scheme for wireless powered mobile edge computing (WP-MEC) system, where the hybrid access point (HAP) equipped with multiple MAs first emits wireless energy to charge wireless devices (WDs), and then receives the offloaded tasks from the WDs for edge computing. The MAs deployed at the HAP enhance the spatial degrees of freedom (DoFs) by fl… ▽ More In this paper, we propose a movable antenna (MA) enhanced scheme for wireless powered mobile edge computing (WP-MEC) system, where the hybrid access point (HAP) equipped with multiple MAs first emits wireless energy to charge wireless devices (WDs), and then receives the offloaded tasks from the WDs for edge computing. The MAs deployed at the HAP enhance the spatial degrees of freedom (DoFs) by flexibly adjusting the positions of MAs within an available region, thereby improving the efficiency of both downlink wireless energy transfer (WPT) and uplink task offloading. To balance the performance enhancement against the implementation intricacy, we further propose three types of MA positioning configurations, i.e., dynamic MA positioning, semi-dynamic MA positioning, and static MA positioning. In addition, the non-linear power conversion of energy harvesting (EH) circuits at the WDs and the finite computing capability at the edge server are taken into account. Our objective is to maximize the sum computational rate (SCR) by jointly optimizing the time allocation, positions of MAs, energy beamforming matrix, receive combing vectors, and offloading strategies of WDs. To solve the non-convex problems, efficient alternating optimization (AO) frameworks are proposed. Moreover, we propose a hybrid algorithm of particle swarm optimization with variable local search (PSO-VLS) to solve the sub-problem of MA positioning. Numerical results validate the superiority of exploiting MAs over the fixed-position antennas (FPAs) for enhancing the SCR performance of WP-MEC systems. △ Less

Submitted 28 April, 2024; originally announced April 2024.

Comments: 13 pages, 10 figures. Submitted for possible publication

arXiv:2404.17400 [pdf, other]

Spatial-frequency Dual-Domain Feature Fusion Network for Low-Light Remote Sensing Image Enhancement

Authors: Zishu Yao, Guodong Fan, **fu Fan, Min Gan, C. L. Philip Chen

Abstract: Low-light remote sensing images generally feature high resolution and high spatial complexity, with continuously distributed surface features in space. This continuity in scenes leads to extensive long-range correlations in spatial domains within remote sensing images. Convolutional Neural Networks, which rely on local correlations for long-distance modeling, struggle to establish long-range corre… ▽ More Low-light remote sensing images generally feature high resolution and high spatial complexity, with continuously distributed surface features in space. This continuity in scenes leads to extensive long-range correlations in spatial domains within remote sensing images. Convolutional Neural Networks, which rely on local correlations for long-distance modeling, struggle to establish long-range correlations in such images. On the other hand, transformer-based methods that focus on global information face high computational complexities when processing high-resolution remote sensing images. From another perspective, Fourier transform can compute global information without introducing a large number of parameters, enabling the network to more efficiently capture the overall image structure and establish long-range correlations. Therefore, we propose a Dual-Domain Feature Fusion Network (DFFN) for low-light remote sensing image enhancement. Specifically, this challenging task of low-light enhancement is divided into two more manageable sub-tasks: the first phase learns amplitude information to restore image brightness, and the second phase learns phase information to refine details. To facilitate information exchange between the two phases, we designed an information fusion affine block that combines data from different phases and scales. Additionally, we have constructed two dark light remote sensing datasets to address the current lack of datasets in dark light remote sensing image enhancement. Extensive evaluations show that our method outperforms existing state-of-the-art methods. The code is available at https://github.com/iijjlk/DFFN. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: 14 page

arXiv:2404.16302 [pdf, other]

CFMW: Cross-modality Fusion Mamba for Multispectral Object Detection under Adverse Weather Conditions

Authors: Haoyuan Li, Qi Hu, You Yao, Kailun Yang, Peng Chen

Abstract: Cross-modality images that integrate visible-infrared spectra cues can provide richer complementary information for object detection. Despite this, existing visible-infrared object detection methods severely degrade in severe weather conditions. This failure stems from the pronounced sensitivity of visible images to environmental perturbations, such as rain, haze, and snow, which frequently cause… ▽ More Cross-modality images that integrate visible-infrared spectra cues can provide richer complementary information for object detection. Despite this, existing visible-infrared object detection methods severely degrade in severe weather conditions. This failure stems from the pronounced sensitivity of visible images to environmental perturbations, such as rain, haze, and snow, which frequently cause false negatives and false positives in detection. To address this issue, we introduce a novel and challenging task, termed visible-infrared object detection under adverse weather conditions. To foster this task, we have constructed a new Severe Weather Visible-Infrared Dataset (SWVID) with diverse severe weather scenes. Furthermore, we introduce the Cross-modality Fusion Mamba with Weather-removal (CFMW) to augment detection accuracy in adverse weather conditions. Thanks to the proposed Weather Removal Diffusion Model (WRDM) and Cross-modality Fusion Mamba (CFM) modules, CFMW is able to mine more essential information of pedestrian features in cross-modality fusion, thus could transfer to other rarer scenarios with high efficiency and has adequate availability on those platforms with low computing power. To the best of our knowledge, this is the first study that targeted improvement and integrated both Diffusion and Mamba modules in cross-modality object detection, successfully expanding the practical application of this type of model with its higher accuracy and more advanced architecture. Extensive experiments on both well-recognized and self-created datasets conclusively demonstrate that our CFMW achieves state-of-the-art detection performance, surpassing existing benchmarks. The dataset and source code will be made publicly available at https://github.com/lhy-zjut/CFMW. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: The dataset and source code will be made publicly available at https://github.com/lhy-zjut/CFMW

arXiv:2403.19044 [pdf, other]

doi 10.1109/TIV.2024.3383063

Low-Complexity Estimation Algorithm and Decoupling Scheme for FRaC System

Authors: Mengjiang Sun, Peng Chen, Zhenxin Cao, Fei Shen

Abstract: With the lea** advances in autonomous vehicles and transportation infrastructure, dual function radar-communication (DFRC) systems have become attractive due to the size, cost and resource efficiency. A frequency modulated continuous waveform (FMCW)-based radar-communication system (FRaC) utilizing both sparse multiple-input and multiple-output (MIMO) arrays and index modulation (IM) has been pr… ▽ More With the lea** advances in autonomous vehicles and transportation infrastructure, dual function radar-communication (DFRC) systems have become attractive due to the size, cost and resource efficiency. A frequency modulated continuous waveform (FMCW)-based radar-communication system (FRaC) utilizing both sparse multiple-input and multiple-output (MIMO) arrays and index modulation (IM) has been proposed to form a DFRC system specifically designed for vehicular applications. In this paper, the three-dimensional (3D) parameter estimation problem in the FRaC is considered. Since the 3D-parameters including range, direction of arrival (DOA) and velocity are coupled in the estimating matrix of the FRaC system, the existing estimation algorithms cannot estimate the 3D-parameters accurately. Hence, a novel decomposed decoupled atomic norm minimization (DANM) method is proposed by splitting the 3D-parameter estimating matrix into multiple 2D matrices with sparsity constraints. Then, the 3D-parameters are estimated and efficiently and separately with the optimized decoupled estimating matrix. Moreover, the Cramér-Rao lower bound (CRLB) of the 3D-parameter estimation are derived, and the computational complexity of the proposed algorithm is analyzed. Simulation results show that the proposed decomposed DANM method exploits the advantage of the virtual aperture in the existence of coupling caused by IM and sparse MIMO array and outperforms the co-estimation algorithm with lower computation complexity. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Journal ref: {IEEE Transactions on Intelligent Vehicles, 2024

arXiv:2403.14978 [pdf, other]

Range-Angle Estimation for FDA-MIMO System With Frequency Offset

Authors: Mengjiang Sun, Peng Chen, Zhenxin Cao

Abstract: Frequency diverse array multiple-input multiple-output (FDA-MIMO) radar differs from the traditional phased array (PA) radar, and can form range-angle-dependent beampattern and differentiate between closely spaced targets sharing the same angle but occupying distinct range cells. In the FDA-MIMO radar, target range estimation is achieved by employing a subtle frequency variation between adjacent a… ▽ More Frequency diverse array multiple-input multiple-output (FDA-MIMO) radar differs from the traditional phased array (PA) radar, and can form range-angle-dependent beampattern and differentiate between closely spaced targets sharing the same angle but occupying distinct range cells. In the FDA-MIMO radar, target range estimation is achieved by employing a subtle frequency variation between adjacent array antennas, so the estimation performance is degraded severely in a practical scenario with frequency offset. In this paper, the range-angle estimation problem for FDA-MIMO radar is considered with frequency offsets in both transmitting and receiving arrays. First, we build a system model for the FDA-MIMO radar with transmitting and receiving frequency offsets. Then, the frequency offset is transferred into an equalized additional noise. The noise characteristics are analyzed in detail theoretically, together with the influence on the range-angle estimation. Moreover, since the effect of the transmitting frequency offset is similar to additional colored noise, denoising algorithms are introduced to mitigate the performance deterioration caused by the frequency offset. Finally, Cramér-Rao lower bounds (CRLB) for the range-angle estimation are derived in the scenario with the frequency offsets. Simulation results show the analysis of frequency offset and the corresponding estimation performance using different algorithms. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Journal ref: IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS, 2024

arXiv:2403.11737 [pdf, other]

SMT-Based Dynamic Multi-Robot Task Allocation

Authors: Victoria Marie Tuck, Pei-Wei Chen, Georgios Fainekos, Bardh Hoxha, Hideki Okamoto, S. Shankar Sastry, Sanjit A. Seshia

Abstract: Multi-Robot Task Allocation (MRTA) is a problem that arises in many application domains including package delivery, warehouse robotics, and healthcare. In this work, we consider the problem of MRTA for a dynamic stream of tasks with task deadlines and capacitated agents (capacity for more than one simultaneous task). Previous work commonly focuses on the static case, uses specialized algorithms fo… ▽ More Multi-Robot Task Allocation (MRTA) is a problem that arises in many application domains including package delivery, warehouse robotics, and healthcare. In this work, we consider the problem of MRTA for a dynamic stream of tasks with task deadlines and capacitated agents (capacity for more than one simultaneous task). Previous work commonly focuses on the static case, uses specialized algorithms for restrictive task specifications, or lacks guarantees. We propose an approach to Dynamic MRTA for capacitated robots that is based on Satisfiability Modulo Theories (SMT) solving and addresses these concerns. We show our approach is both sound and complete, and that the SMT encoding is general, enabling extension to a broader class of task specifications. We show how to leverage the incremental solving capabilities of SMT solvers, kee** learned information when allocating new tasks arriving online, and to solve non-incrementally, which we provide runtime comparisons of. Additionally, we provide an algorithm to start with a smaller but potentially incomplete encoding that can iteratively be adjusted to the complete encoding. We evaluate our method on a parameterized set of benchmarks encoding multi-robot delivery created from a graph abstraction of a hospital-like environment. The effectiveness of our approach is demonstrated using a range of encodings, including quantifier-free theories of uninterpreted functions and linear or bitvector arithmetic across multiple solvers. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: 26 pages, 6 figures, to be published in NASA Formal Methods Symposium 2024

arXiv:2403.02854 [pdf, ps, other]

STAR-RIS Assisted Wireless-Powered and Backscattering Mobile Edge Computing Networks

Authors: Bin Lyu, Yining Zhang, Pengcheng Chen, Ziwei Liu, Feng Tian

Abstract: Wireless powered and backscattering mobile edge computing (WPB-MEC) network is a novel network paradigm to supply energy supplies and computing resource to wireless sensors (WSs). However, its performance is seriously affected by severe attenuations and inappropriate assumptions of infinite computing capability at the hybrid access point (HAP). To address the above issues, in this paper, we propos… ▽ More Wireless powered and backscattering mobile edge computing (WPB-MEC) network is a novel network paradigm to supply energy supplies and computing resource to wireless sensors (WSs). However, its performance is seriously affected by severe attenuations and inappropriate assumptions of infinite computing capability at the hybrid access point (HAP). To address the above issues, in this paper, we propose a simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) aided scheme for boosting the performance of WPB-MEC network under the constraint of finite computing capability. Specifically, energy-constrained WSs are able to offload tasks actively or passively from them to the HAP. In this process, the STAR-RIS is utilized to improve the quantity of harvested energy and strengthen the offloading efficiency by adapting its operating protocols. We then maximize the sum computational bits (SCBs) under the finite computing capability constraint. To handle the solving challenges, we first present interesting results in closed-form and then design a block coordinate descent (BCD) based algorithm, ensuring a near-optimal solution. Finally, simulation results are provided to confirm that our proposed scheme can improve the SCBs by 9.9 times compared to the local computing only scheme. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: Accepted by China Communications. 13 pages, 8 figures

arXiv:2402.05457 [pdf, other]

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

Authors: Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, Chao-Han Huck Yang

Abstract: Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct map** from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introd… ▽ More Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct map** from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a multimodal fusion approach implemented into an auto-regressive decoding process and works in two stages: (i) It first analyzes and calibrates the token-level LLM decision, and (ii) it then dynamically assimilates the information from the acoustic modality. Experimental evidence collected from various ASR tasks shows that UADF surpasses existing fusion mechanisms in several ways. It yields significant improvements in word error rate (WER) while mitigating data uncertainty issues in LLM and addressing the poor generalization relied with sole modality during fusion. We also demonstrate that UADF seamlessly adapts to audio-visual speech recognition. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: Accepted to ICLR 2024, 17 pages. This work will be open sourced under MIT license

arXiv:2402.02140 [pdf, other]

Generative Visual Compression: A Review

Authors: Bolin Chen, Shanzhi Yin, Peilin Chen, Shiqi Wang, Yan Ye

Abstract: Artificial Intelligence Generated Content (AIGC) is leading a new technical revolution for the acquisition of digital content and impelling the progress of visual compression towards competitive performance gains and diverse functionalities over traditional codecs. This paper provides a thorough review on the recent advances of generative visual compression, illustrating great potentials and promi… ▽ More Artificial Intelligence Generated Content (AIGC) is leading a new technical revolution for the acquisition of digital content and impelling the progress of visual compression towards competitive performance gains and diverse functionalities over traditional codecs. This paper provides a thorough review on the recent advances of generative visual compression, illustrating great potentials and promising applications in ultra-low bitrate communication, user-specified reconstruction/filtering, and intelligent machine analysis. In particular, we review the visual data compression methodologies with deep generative models, and summarize how compact representation and high-fidelity reconstruction could be actualized via generative techniques. In addition, we generalize related generative compression technologies for machine vision and intelligent analytics. Finally, we discuss the fundamental challenges on generative visual compression techniques and envision their future research directions. △ Less

Submitted 3 February, 2024; originally announced February 2024.

arXiv:2401.10446 [pdf, other]

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EnSiong Chng

Abstract: Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the map** from ASR N-best hypotheses to ground-truth transcription by e… ▽ More Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the map** from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do}, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: Accepted to ICLR 2024, Spotlight top 5%, 24 pages. This work will be open sourced at: https://github.com/YUCHEN005/RobustGER under MIT license

arXiv:2312.14018 [pdf, ps, other]

Enabling Secure Wireless Communications via Movable Antennas

Authors: Zhenqiao Cheng, Nanxi Li, Jianchi Zhu, Xiaoming She, Chongjun Ouyang, Peng Chen

Abstract: A pioneering secure transmission scheme is proposed, which harnesses movable antennas (MAs) to optimize antenna positions for augmenting the physical layer security. Particularly, an MA-enabled secure wireless system is considered, where a multi-antenna transmitter communicates with a single-antenna receiver in the presence of an eavesdropper. The beamformer and antenna positions at the transmitte… ▽ More A pioneering secure transmission scheme is proposed, which harnesses movable antennas (MAs) to optimize antenna positions for augmenting the physical layer security. Particularly, an MA-enabled secure wireless system is considered, where a multi-antenna transmitter communicates with a single-antenna receiver in the presence of an eavesdropper. The beamformer and antenna positions at the transmitter are jointly optimized under two criteria: power consumption minimization and secrecy rate maximization. For each scenario, a novel suboptimal algorithm was proposed to tackle the resulting nonconvex optimization problem, capitalizing on the approaches of alternating optimization and gradient descent. Numerical results demonstrate that the proposed MA systems significantly improve physical layer security compared to various benchmark schemes relying on conventional fixed-position antennas (FPAs). △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: Accepted by IEEE ICASSP 2024

arXiv:2312.05187 [pdf, other]

Seamless: Multilingual Expressive and Streaming Speech Translation

Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek , et al. (40 additional authors not shown)

Abstract: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4… ▽ More Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2312.01644 [pdf]

TMSR: Tiny Multi-path CNNs for Super Resolution

Authors: Chia-Hung Liu, Tzu-Hsin Hsieh, Kuan-Yu Huang, Pei-Yin Chen

Abstract: In this paper, we proposed a tiny multi-path CNN-based Super-Resolution (SR) method, called TMSR. We mainly refer to some tiny CNN-based SR methods, under 5k parameters. The main contribution of the proposed method is the improved multi-path learning and self-defined activated function. The experimental results show that TMSR obtains competitive image quality (i.e. PSNR and SSIM) compared to the r… ▽ More In this paper, we proposed a tiny multi-path CNN-based Super-Resolution (SR) method, called TMSR. We mainly refer to some tiny CNN-based SR methods, under 5k parameters. The main contribution of the proposed method is the improved multi-path learning and self-defined activated function. The experimental results show that TMSR obtains competitive image quality (i.e. PSNR and SSIM) compared to the related works under 5k parameters. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: 5 pages, 7 figures, published in the IEEE Eurasia Conference on IoT, Communication and Engineering proceedings 2023

arXiv:2311.16565 [pdf, other]

DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser

Authors: Peng Chen, Xiaobao Wei, Ming Lu, Yitong Zhu, Naiming Yao, Xingyu Xiao, Hui Chen

Abstract: Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic map** from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation ge… ▽ More Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic map** from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation generation are still two major limitations of existing diffusion-based methods. To address the above limitations, we propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation. Specifically, to enable personalization, we introduce a learnable talking identity to aggregate knowledge in audio sequences. The proposed identity embeddings extract customized facial cues across different people in a contrastive learning manner. During inference, users can obtain personalized facial animation based on input audio, reflecting a specific talking style. With a trained diffusion model with hundreds of steps, we distill it into a lightweight model with 8 steps for acceleration. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released. △ Less

Submitted 2 December, 2023; v1 submitted 28 November, 2023; originally announced November 2023.

arXiv:2310.13259 [pdf]

Domain-specific optimization and diverse evaluation of self-supervised models for histopathology

Authors: Jeremy Lai, Faruk Ahmed, Supriya Vijay, Tiam Jaroensri, Jessica Loo, Saurabh Vyawahare, Saloni Agarwal, Fayaz Jamil, Yossi Matias, Greg S. Corrado, Dale R. Webster, Jonathan Krause, Yun Liu, Po-Hsuan Cameron Chen, Ellery Wulczyn, David F. Steiner

Abstract: Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential… ▽ More Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, we describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL). We first establish a diverse set of benchmark tasks involving 17 unique tissue types and 12 unique cancer types and spanning different optimal magnifications and task types. Next, we use this benchmark to explore and evaluate histopathology-specific SSL methods followed by further evaluation on held out patch-level and weakly supervised tasks. We found that standard SSL methods thoughtfully applied to histopathology images are performant across our benchmark tasks and that domain-specific methodological improvements can further increase performance. Our findings reinforce the value of using domain-specific SSL methods in pathology, and establish a set of high quality foundation models to enable further research across diverse applications. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: 4 main tables, 3 main figures, additional supplemental tables and figures

arXiv:2310.05051 [pdf, other]

SALT: Distinguishable Speaker Anonymization Through Latent Space Transformation

Authors: Yuanjun Lv, Jixun Yao, Peikun Chen, Hongbin Zhou, Heng Lu, Lei Xie

Abstract: Speaker anonymization aims to conceal a speaker's identity without degrading speech quality and intelligibility. Most speaker anonymization systems disentangle the speaker representation from the original speech and achieve anonymization by averaging or modifying the speaker representation. However, the anonymized speech is subject to reduction in pseudo speaker distinctiveness, speech quality and… ▽ More Speaker anonymization aims to conceal a speaker's identity without degrading speech quality and intelligibility. Most speaker anonymization systems disentangle the speaker representation from the original speech and achieve anonymization by averaging or modifying the speaker representation. However, the anonymized speech is subject to reduction in pseudo speaker distinctiveness, speech quality and intelligibility for out-of-distribution speaker. To solve this issue, we propose SALT, a Speaker Anonymization system based on Latent space Transformation. Specifically, we extract latent features by a self-supervised feature extractor and randomly sample multiple speakers and their weights, and then interpolate the latent vectors to achieve speaker anonymization. Meanwhile, we explore the extrapolation method to further extend the diversity of pseudo speakers. Experiments on Voice Privacy Challenge dataset show our system achieves a state-of-the-art distinctiveness metric while preserving speech quality and intelligibility. Our code and demo is availible at https://github.com/BakerBunker/SALT . △ Less

Submitted 8 October, 2023; originally announced October 2023.

Comments: 8 pages, 3 figures; Accepted by ASRU2023

arXiv:2310.04645 [pdf, other]

Do self-supervised speech and language models extract similar representations as human brain?

Authors: Peili Chen, Linyang He, Li Fu, Lu Fan, Edward F. Chang, Yuanning Li

Abstract: Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception. However, given their distinct training modalities, it remains unclear whether they correlate with the same neural aspects. We directly address this question by evaluating the brain prediction performance of two representative SSL models,… ▽ More Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception. However, given their distinct training modalities, it remains unclear whether they correlate with the same neural aspects. We directly address this question by evaluating the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2, designed for speech and language tasks. Our findings reveal that both models accurately predict speech responses in the auditory cortex, with a significant correlation between their brain predictions. Notably, shared speech contextual information between Wav2Vec2.0 and GPT-2 accounts for the majority of explained variance in brain activity, surpassing static semantic and lower-level acoustic-phonetic information. These results underscore the convergence of speech contextual representations in SSL models and their alignment with the neural network underlying speech perception, offering valuable insights into both SSL models and the neural basis of speech and language processing. △ Less

Submitted 31 January, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: To appear in 2024 IEEE International Conference on Acoustics, Speech and Signal Processing

arXiv:2310.02629 [pdf, other]

BA-MoE: Boundary-Aware Mixture-of-Experts Adapter for Code-Switching Speech Recognition

Authors: Peikun Chen, Fan Yu, Yuhao Lian, Hongfei Xue, Xucheng Wan, Naijun Zheng, Huan Zhou, Lei Xie

Abstract: Mixture-of-experts based models, which use language experts to extract language-specific representations effectively, have been well applied in code-switching automatic speech recognition. However, there is still substantial space to improve as similar pronunciation across languages may result in ineffective multi-language modeling and inaccurate language boundary estimation. To eliminate these dr… ▽ More Mixture-of-experts based models, which use language experts to extract language-specific representations effectively, have been well applied in code-switching automatic speech recognition. However, there is still substantial space to improve as similar pronunciation across languages may result in ineffective multi-language modeling and inaccurate language boundary estimation. To eliminate these drawbacks, we propose a cross-layer language adapter and a boundary-aware training method, namely Boundary-Aware Mixture-of-Experts (BA-MoE). Specifically, we introduce language-specific adapters to separate language-specific representations and a unified gating layer to fuse representations within each encoder layer. Second, we compute language adaptation loss of the mean output of each language-specific adapter to improve the adapter module's language-specific representation learning. Besides, we utilize a boundary-aware predictor to learn boundary representations for dealing with language boundary confusion. Our approach achieves significant performance improvement, reducing the mixture error rate by 16.55\% compared to the baseline on the ASRU 2019 Mandarin-English code-switching challenge dataset. △ Less

Submitted 7 October, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

Comments: Accepted by ASRU2023

arXiv:2309.16937 [pdf, other]

SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition

Authors: Hongfei Xue, Qijie Shao, Kaixun Huang, Peikun Chen, Jie Liu, Lei Xie

Abstract: Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, w… ▽ More Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model. We first analyze the different layers of MMS and show that the middle layers capture language-related information, and the high layers encode content-related information, which gradually decreases in the final layers. Then, we extract a language-related frame from correlated middle layers and guide specific language extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance. △ Less

Submitted 27 April, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

Comments: 5 pages, 2 figures. Accepted by ICME 2024

arXiv:2309.15701 [pdf, other]

HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Macro Siniscalchi, Pin-Yu Chen, Eng Siong Chng

Abstract: Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuit… ▽ More Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs. △ Less

Submitted 16 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

Comments: Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track. Added the first Mandarin and code-switching (zh-cn and en-us) results from the LLM-based generative ASR error correction to Table 8 on Page 21

arXiv:2309.13902 [pdf, other]

doi 10.1109/TVT.2023.3293189

NoncovANM: Gridless DOA Estimation for LPDF System

Authors: Yangying Zhao, Peng Chen, Zhenxin Cao, Xianbin Wang

Abstract: Direction of arrival (DOA) estimation is an important research in the area of array signal processing, and has been studied for decades. High resolution DOA estimation requires large array aperture, which leads to the increase of hardware cost. Besides, high accuracy DOA estimation methods usually have high computational complexity. In this paper, the problem of decreasing the hardware cost and al… ▽ More Direction of arrival (DOA) estimation is an important research in the area of array signal processing, and has been studied for decades. High resolution DOA estimation requires large array aperture, which leads to the increase of hardware cost. Besides, high accuracy DOA estimation methods usually have high computational complexity. In this paper, the problem of decreasing the hardware cost and algorithm complexity is addressed. First, considering the ability of flexible controlling the electromagnetic waves and low-cost, an intelligent reconfigurable surface (IRS)-aided low-cost passive direction finding (LPDF) system is developed, where only one fully functional receiving channel is adopted. Then, the sparsity of targets direction in the spatial domain is exploited by formulating an atomic norm minimization (ANM) problem to estimate the DOA. Traditionally, solving ANM problem is complex and cannot be realized efficiently. Hence, a novel nonconvex-based ANM (NC-ANM) method is proposed by gradient threshold iteration, where a perturbation is introduced to avoid falling into saddle points. The theoretical analysis for the convergence of the NC-ANM method is also given. Moreover, the corresponding Cramér-Rao lower bound (CRLB) in the LPDF system is derived, and taken as the referred bound of the DOA estimation. Simulation results show that the proposed method outperforms the compared methods in the DOA estimation with lower computational complexity in the LPDF system. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: 11 pages, 8 figures

Journal ref: IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2023

arXiv:2309.13856 [pdf, other]

doi 10.1109/TVT.2023.3319538

DNN-DANM: A High-Accuracy Two-Dimensional DOA Estimation Method Using Practical RIS

Authors: Zhimin Chen, Peng Chen, Le Zheng, Yudong Zhang

Abstract: Reconfigurable intelligent surface (RIS) or intelligent reflecting surface (IRS) has been an attractive technology for future wireless communication and sensing systems. However, in the practical RIS, the mutual coupling effect among RIS elements, the reflection phase shift, and amplitude errors will degrade the RIS performance significantly. This paper investigates the two-dimensional direction-o… ▽ More Reconfigurable intelligent surface (RIS) or intelligent reflecting surface (IRS) has been an attractive technology for future wireless communication and sensing systems. However, in the practical RIS, the mutual coupling effect among RIS elements, the reflection phase shift, and amplitude errors will degrade the RIS performance significantly. This paper investigates the two-dimensional direction-of-arrival (DOA) estimation problem in the scenario using a practical RIS. After formulating the system model with the mutual coupling effect and the reflection phase/amplitude errors of the RIS, a novel DNNDANM method is proposed for the DOA estimation by combining the deep neural network (DNN) and the decoupling atomic norm minimization (DANM). The DNN step reconstructs the received signal from the one with RIS impairments, and the DANM step exploits the signal sparsity in the two-dimensional spatial domain. Additionally, a semi-definite programming (SDP) method with low computational complexity is proposed to solve the atomic minimization problem. Finally, both simulation and prototype are carried out to show estimation performance, and the proposed method outperforms the existing methods in the two-dimensional DOA estimation with low complexity in the scenario with practical RIS. △ Less

Submitted 24 September, 2023; originally announced September 2023.

Comments: 11 pages, 12 figures

Journal ref: IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2023

arXiv:2309.12596 [pdf, ps, other]

Movable Antenna-Empowered AirComp

Authors: Zhenqiao Cheng, Nanxi Li, Jianchi Zhu, Xiaoming She, Chongjun Ouyang, Peng Chen

Abstract: A novel over-the-air computation (AirComp) framework, empowered by the incorporation of movable antennas (MAs), is proposed to significantly enhance computation accuracy. Within this framework, the joint optimization of transmit power control, antenna positioning, and receive combining is investigated. An efficient method is proposed to tackle the problem of computation mean-squared error (MSE) mi… ▽ More A novel over-the-air computation (AirComp) framework, empowered by the incorporation of movable antennas (MAs), is proposed to significantly enhance computation accuracy. Within this framework, the joint optimization of transmit power control, antenna positioning, and receive combining is investigated. An efficient method is proposed to tackle the problem of computation mean-squared error (MSE) minimization, capitalizing on the approach of alternating optimization. Numerical results are provided to substantiate the superior MSE performance of the proposed framework, which establish its clear advantage over benchmark systems employing conventional fixed-position antennas (FPAs). △ Less

Submitted 21 September, 2023; originally announced September 2023.

arXiv:2309.02259 [pdf, ps, other]

Design of a New CIM-DCSK-Based Ambient Backscatter Communication System

Authors: Ruipeng Yang, Yi Fang, **** Chen, Huan Ma

Abstract: To improve the data rate in differential chaos shift keying (DCSK) based ambient backscatter communication (AmBC) system, we propose a new AmBC system based on code index modulation (CIM), referred to as CIM-DCSK-AmBC system. In the proposed system, the CIM-DCSK signal transmitted in the direct link is used as the radio frequency source of the backscatter link. The signal format in the backscatter… ▽ More To improve the data rate in differential chaos shift keying (DCSK) based ambient backscatter communication (AmBC) system, we propose a new AmBC system based on code index modulation (CIM), referred to as CIM-DCSK-AmBC system. In the proposed system, the CIM-DCSK signal transmitted in the direct link is used as the radio frequency source of the backscatter link. The signal format in the backscatter link is designed to increase the data rate as well as eliminate the interference of the direct link signal. As such, the direct link signal and the backscatter link signal can be received and demodulated simultaneously. Moreover, we derive and validate the theoretical bit error rate (BER) expressions of the CIM-DCSK-AmBC system over multipath Rayleigh fading channels. Regarding the short reference DCSK-based AmBC (SR-DCSK-AmBC) system as a benchmark system, numerical results reveal that the CIM-DCSK-AmBC system can achieve better BER performance in the direct link and higher throughput in the backscatter link than the benchmark system. △ Less

Submitted 5 September, 2023; originally announced September 2023.

arXiv:2309.01207 [pdf, other]

Spectral Adversarial MixUp for Few-Shot Unsupervised Domain Adaptation

Authors: Jia** Zhang, Hanqing Chao, Amit Dhurandhar, Pin-Yu Chen, Ali Tajer, Yangyang Xu, **kun Yan

Abstract: Domain shift is a common problem in clinical applications, where the training images (source domain) and the test images (target domain) are under different distributions. Unsupervised Domain Adaptation (UDA) techniques have been proposed to adapt models trained in the source domain to the target domain. However, those methods require a large number of images from the target domain for model train… ▽ More Domain shift is a common problem in clinical applications, where the training images (source domain) and the test images (target domain) are under different distributions. Unsupervised Domain Adaptation (UDA) techniques have been proposed to adapt models trained in the source domain to the target domain. However, those methods require a large number of images from the target domain for model training. In this paper, we propose a novel method for Few-Shot Unsupervised Domain Adaptation (FSUDA), where only a limited number of unlabeled target domain samples are available for training. To accomplish this challenging task, first, a spectral sensitivity map is introduced to characterize the generalization weaknesses of models in the frequency domain. We then developed a Sensitivity-guided Spectral Adversarial MixUp (SAMix) method to generate target-style images to effectively suppresses the model sensitivity, which leads to improved model generalizability in the target domain. We demonstrated the proposed method and rigorously evaluated its performance on multiple tasks using several public datasets. △ Less

Submitted 3 September, 2023; originally announced September 2023.

Comments: Accepted by MICCAI 2023

arXiv:2307.14491 [pdf, other]

A Unified Framework for Modality-Agnostic Deepfakes Detection

Authors: Cai Yu, Peng Chen, Jiahe Tian, ** Liu, Jiao Dai, Xi Wang, Yesheng Chai, Shan Jia, Siwei Lyu, Jizhong Han

Abstract: As AI-generated content (AIGC) thrives, deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence betw… ▽ More As AI-generated content (AIGC) thrives, deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence between the audio and visual modalities for binary real/fake classification, and require the co-occurrence of both modalities. However, in real-world multi-modal applications, missing modality scenarios may occur where either modality is unavailable. In such cases, audio-visual detection methods are less practical than two independent unimodal methods. Consequently, the detector can not always obtain the number or type of manipulated modalities beforehand, necessitating a fake-modality-agnostic audio-visual detector. In this work, we introduce a comprehensive framework that is agnostic to fake modalities, which facilitates the identification of multimodal deepfakes and handles situations with missing modalities, regardless of the manipulations embedded in audio, video, or even cross-modal forms. To enhance the modeling of cross-modal forgery clues, we employ audio-visual speech recognition (AVSR) as a preliminary task. This efficiently extracts speech correlations across modalities, a feature challenging for deepfakes to replicate. Additionally, we propose a dual-label detection approach that follows the structure of AVSR to support the independent detection of each modality. Extensive experiments on three audio-visual datasets show that our scheme outperforms state-of-the-art detection methods with promising performance on modality-agnostic audio/video deepfakes. △ Less

Submitted 24 October, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2307.10757 [pdf, other]

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

Authors: Weidong Chen, Xiaofen Xing, Peihao Chen, Xiangmin Xu

Abstract: This paper presents a paradigm that adapts general large-scale pretrained models (PTMs) to speech emotion recognition task. Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, and thus, their efficacy for specific tasks can be further improved. Additionally, employing PTMs in practical applications can be challenging due to their consid… ▽ More This paper presents a paradigm that adapts general large-scale pretrained models (PTMs) to speech emotion recognition task. Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, and thus, their efficacy for specific tasks can be further improved. Additionally, employing PTMs in practical applications can be challenging due to their considerable size. Above limitations spawn another research direction, namely, optimizing large-scale PTMs for specific tasks to generate task-specific PTMs that are both compact and effective. In this paper, we focus on the speech emotion recognition task and propose an improved emotion-specific pretrained encoder called Vesper. Vesper is pretrained on a speech dataset based on WavLM and takes into account emotional characteristics. To enhance sensitivity to emotional information, Vesper employs an emotion-guided masking strategy to identify the regions that need masking. Subsequently, Vesper employs hierarchical and cross-layer self-supervision to improve its ability to capture acoustic and semantic representations, both of which are crucial for emotion recognition. Experimental results on the IEMOCAP, MELD, and CREMA-D datasets demonstrate that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers. △ Less

Submitted 18 April, 2024; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: This paper was accepted by IEEE Transactions on Affective Computing 2024

arXiv:2307.04630 [pdf, other]

The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

Authors: Kun Song, Yi lei, Peikun Chen, Yiqing Cao, Kun Wei, Yongmao Zhang, Lei Xie, Ning Jiang, Guoqing Zhao

Abstract: This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Spec… ▽ More This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Specifically, to improve the robustness to multi-source speech input, we adopt various data augmentation strategies and a ROVER-based score fusion on multiple ASR model outputs. To better handle the noisy ASR transcripts, we introduce a three-stage fine-tuning strategy to improve translation accuracy. Finally, we build a TTS model with high naturalness and sound quality, which leverages a two-stage framework, using network bottleneck features as a robust intermediate representation for speaker timbre and linguistic content disentanglement. Based on the two-stage framework, pre-trained speaker embedding is leveraged as a condition to transfer the speaker timbre in the source English speech to the translated Chinese speech. Experimental results show that our system has high translation accuracy, speech naturalness, sound quality, and speaker similarity. Moreover, it shows good robustness to multi-source data. △ Less

Submitted 10 July, 2023; originally announced July 2023.

Comments: IWSLT@ACL 2023 system paper. Our submitted system ranks 1st in the S2ST task of the IWSLT 2023 evaluation campaign

arXiv:2306.12925 [pdf, other]

AudioPaLM: A Large Language Model That Can Speak and Listen

Authors: Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats , et al. (5 additional authors not shown)

Abstract: We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the… ▽ More We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples △ Less

Submitted 22 June, 2023; originally announced June 2023.

Comments: Technical report

arXiv:2305.13629 [pdf, other]

doi 10.21437/Interspeech.2023-746

TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition

Authors: Hongfei Xue, Qijie Shao, Peikun Chen, Pengcheng Guo, Lei Xie, Jie Liu

Abstract: UniSpeech has achieved superior performance in cross-lingual automatic speech recognition (ASR) by explicitly aligning latent representations to phoneme units using multi-task self-supervised learning. While the learned representations transfer well from high-resource to low-resource languages, predicting words directly from these phonetic representations in downstream ASR is challenging. In this… ▽ More UniSpeech has achieved superior performance in cross-lingual automatic speech recognition (ASR) by explicitly aligning latent representations to phoneme units using multi-task self-supervised learning. While the learned representations transfer well from high-resource to low-resource languages, predicting words directly from these phonetic representations in downstream ASR is challenging. In this paper, we propose TranUSR, a two-stage model comprising a pre-trained UniData2vec and a phoneme-to-word Transcoder. Different from UniSpeech, UniData2vec replaces the quantized discrete representations with continuous and contextual representations from a teacher model for phonetically-aware pre-training. Then, Transcoder learns to translate phonemes to words with the aid of extra texts, enabling direct word generation. Experiments on Common Voice show that UniData2vec reduces PER by 5.3% compared to UniSpeech, while Transcoder yields a 14.4% WER reduction compared to grapheme fine-tuning. △ Less

Submitted 8 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: 5 pages, 3 figures. Accepted by INTERSPEECH 2023

arXiv:2305.05203 [pdf, other]

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing

Authors: **gbei Li, Sipan Li, ** Chen, Luwen Zhang, Yi Meng, Zhiyong Wu, Helen Meng, Qiao Tian, Yu** Wang, Yuxuan Wang

Abstract: Automatic dubbing, which generates a corresponding version of the input speech in another language, could be widely utilized in many real-world scenarios such as video and game localization. In addition to synthesizing the translated scripts, automatic dubbing needs to further transfer the speaking style in the original language to the dubbed speeches to give audiences the impression that the char… ▽ More Automatic dubbing, which generates a corresponding version of the input speech in another language, could be widely utilized in many real-world scenarios such as video and game localization. In addition to synthesizing the translated scripts, automatic dubbing needs to further transfer the speaking style in the original language to the dubbed speeches to give audiences the impression that the characters are speaking in their native tongue. However, state-of-the-art automatic dubbing systems only model the transfer on duration and speaking rate, neglecting the other aspects in speaking style such as emotion, intonation and emphasis which are also crucial to fully perform the characters and speech understanding. In this paper, we propose a joint multi-scale cross-lingual speaking style transfer framework to simultaneously model the bidirectional speaking style transfer between languages at both global (i.e. utterance level) and local (i.e. word level) scales. The global and local speaking styles in each language are extracted and utilized to predicted the global and local speaking styles in the other language with an encoder-decoder framework for each direction and a shared bidirectional attention mechanism for both directions. A multi-scale speaking style enhanced FastSpeech 2 is then utilized to synthesize the predicted the global and local speaking styles to speech for each language. Experiment results demonstrate the effectiveness of our proposed framework, which outperforms a baseline with only duration transfer in both objective and subjective evaluations. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: Submitted to TASLP

arXiv:2305.03982 [pdf]

Pitch Estimation by Denoising Preprocessor and Hybrid Estimation Model

Authors: Yu Cheng Hung, ** Hung Chen, Jian Jiun Ding

Abstract: Pitch estimation is to estimate the fundamental frequency and the midi number and plays a critical role in music signal analysis and vocal signal processing. In this work, we proposed a new architecture based on a learning-based enhancement preprocessor and a combination of several traditional and deep learning pitch estimation methods to achieve better pitch estimation performance in both noisy a… ▽ More Pitch estimation is to estimate the fundamental frequency and the midi number and plays a critical role in music signal analysis and vocal signal processing. In this work, we proposed a new architecture based on a learning-based enhancement preprocessor and a combination of several traditional and deep learning pitch estimation methods to achieve better pitch estimation performance in both noisy and clean scenarios. We test 17 different types of noise and 4 SNRdb noise levels. The results show that the proposed pitch estimation can perform better in both noisy and clean scenarios with short response time. △ Less

Submitted 6 May, 2023; originally announced May 2023.

Comments: From ICCE-Taiwan

arXiv:2305.01309 [pdf, other]

doi 10.1109/TCSVT.2024.3379518

Geometric Prior Based Deep Human Point Cloud Geometry Compression

Authors: Xinju Wu, **** Zhang, Meng Wang, Peilin Chen, Shiqi Wang, Sam Kwong

Abstract: The emergence of digital avatars has raised an exponential increase in the demand for human point clouds with realistic and intricate details. The compression of such data becomes challenging with overwhelming data amounts comprising millions of points. Herein, we leverage the human geometric prior in geometry redundancy removal of point clouds, greatly promoting the compression performance. More… ▽ More The emergence of digital avatars has raised an exponential increase in the demand for human point clouds with realistic and intricate details. The compression of such data becomes challenging with overwhelming data amounts comprising millions of points. Herein, we leverage the human geometric prior in geometry redundancy removal of point clouds, greatly promoting the compression performance. More specifically, the prior provides topological constraints as geometry initialization, allowing adaptive adjustments with a compact parameter set that could be represented with only a few bits. Therefore, we can envisage high-resolution human point clouds as a combination of geometric priors and structural deviations. The priors could first be derived with an aligned point cloud, and subsequently the difference of features is compressed into a compact latent code. The proposed framework can operate in a play-and-plug fashion with existing learning based point cloud compression methods. Extensive experimental results show that our approach significantly improves the compression performance without deteriorating the quality, demonstrating its promise in a variety of applications. △ Less

Submitted 25 March, 2024; v1 submitted 2 May, 2023; originally announced May 2023.

Comments: Accepted by TCSVT 2024

arXiv:2304.13928 [pdf, ps, other]

Cramer-Rao Lower Bound Analysis for OTFS and OFDM Modulation Systems

Authors: Bowen Wang, Jianchi Zhu, Xiaoming She, Peng Chen

Abstract: The orthogonal time frequency space (OTFS) modulation as a promising signal representation attracts growingcinterest for integrated sensing and communication (ISAC), yet its merits over orthogonal frequency division multiplexing (OFDM) remain controversial. This paper devotes to a comprehensive comparison of OTFS and OFDM for sensing from the perspective of Cramer-Rao lower bounds (CRLB) analysis.… ▽ More The orthogonal time frequency space (OTFS) modulation as a promising signal representation attracts growingcinterest for integrated sensing and communication (ISAC), yet its merits over orthogonal frequency division multiplexing (OFDM) remain controversial. This paper devotes to a comprehensive comparison of OTFS and OFDM for sensing from the perspective of Cramer-Rao lower bounds (CRLB) analysis. To this end, we develop the cyclic prefix (CP)-Free and CP-added model for OFDM, while for OTFS, we consider the Zak transform based and the Two-Step conversion based models, respectively. Then we rephrase these four models into a unified matrix format to derive the CRLB of the delays and doppler shifts for multipath scenario. Numerical results demonstrate the superiority of OTFS modulation for sensing, and the effect of physical parameters for performance achievement. △ Less

Submitted 26 April, 2023; originally announced April 2023.

arXiv:2304.03104 [pdf, other]

Constrained Exploration in Reinforcement Learning with Optimality Preservation

Authors: Peter C. Y. Chen

Abstract: We consider a class of reinforcement-learning systems in which the agent follows a behavior policy to explore a discrete state-action space to find an optimal policy while adhering to some restriction on its behavior. Such restriction may prevent the agent from visiting some state-action pairs, possibly leading to the agent finding only a sub-optimal policy. To address this problem we introduce th… ▽ More We consider a class of reinforcement-learning systems in which the agent follows a behavior policy to explore a discrete state-action space to find an optimal policy while adhering to some restriction on its behavior. Such restriction may prevent the agent from visiting some state-action pairs, possibly leading to the agent finding only a sub-optimal policy. To address this problem we introduce the concept of constrained exploration with optimality preservation, whereby the exploration behavior of the agent is constrained to meet a specification while the optimality of the (original) unconstrained learning process is preserved. We first establish a feedback-control structure that models the dynamics of the unconstrained learning process. We then extend this structure by adding a supervisor to ensure that the behavior of the agent meets the specification, and establish (for a class of reinforcement-learning problems with a known deterministic environment) a necessary and sufficient condition under which optimality is preserved. This work demonstrates the utility and the prospect of studying reinforcement-learning problems in the context of the theories of discrete-event systems, automata and formal languages. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: 33 pages, and 6 figures

arXiv:2303.06341 [pdf, other]

The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge

Authors: Pengcheng Guo, He Wang, Bingshen Mu, Ao Zhang, Peikun Chen

Abstract: This paper describes our NPU-ASLP system for the Audio-Visual Diarization and Recognition (AVDR) task in the Multi-modal Information based Speech Processing (MISP) 2022 Challenge. Specifically, the weighted prediction error (WPE) and guided source separation (GSS) techniques are used to reduce reverberation and generate clean signals for each single speaker first. Then, we explore the effectivenes… ▽ More This paper describes our NPU-ASLP system for the Audio-Visual Diarization and Recognition (AVDR) task in the Multi-modal Information based Speech Processing (MISP) 2022 Challenge. Specifically, the weighted prediction error (WPE) and guided source separation (GSS) techniques are used to reduce reverberation and generate clean signals for each single speaker first. Then, we explore the effectiveness of Branchformer and E-Branchformer based ASR systems. To better make use of the visual modality, a cross-attention based multi-modal fusion module is proposed, which explicitly learns the contextual relationship between different modalities. Experiments show that our system achieves a concatenated minimum-permutation character error rate (cpCER) of 28.13\% and 31.21\% on the Dev and Eval set, and obtains second place in the challenge. △ Less

Submitted 11 March, 2023; originally announced March 2023.

Comments: 2 pages, accepted by ICASSP 2023

arXiv:2302.12662 [pdf, other]

FedDBL: Communication and Data Efficient Federated Deep-Broad Learning for Histopathological Tissue Classification

Authors: Tianpeng Deng, Yanqi Huang, Guoqiang Han, Zhenwei Shi, Jiatai Lin, Qi Dou, Zaiyi Liu, Xiao-**g Guo, C. L. Philip Chen, Chu Han

Abstract: Histopathological tissue classification is a fundamental task in computational pathology. Deep learning-based models have achieved superior performance but centralized training with data centralization suffers from the privacy leakage problem. Federated learning (FL) can safeguard privacy by kee** training samples locally, but existing FL-based frameworks require a large number of well-annotated… ▽ More Histopathological tissue classification is a fundamental task in computational pathology. Deep learning-based models have achieved superior performance but centralized training with data centralization suffers from the privacy leakage problem. Federated learning (FL) can safeguard privacy by kee** training samples locally, but existing FL-based frameworks require a large number of well-annotated training samples and numerous rounds of communication which hinder their practicability in the real-world clinical scenario. In this paper, we propose a universal and lightweight federated learning framework, named Federated Deep-Broad Learning (FedDBL), to achieve superior classification performance with limited training samples and only one-round communication. By simply associating a pre-trained deep learning feature extractor, a fast and lightweight broad learning inference system and a classical federated aggregation approach, FedDBL can dramatically reduce data dependency and improve communication efficiency. Five-fold cross-validation demonstrates that FedDBL greatly outperforms the competitors with only one-round communication and limited training samples, while it even achieves comparable performance with the ones under multiple-round communications. Furthermore, due to the lightweight design and one-round communication, FedDBL reduces the communication burden from 4.6GB to only 276.5KB per client using the ResNet-50 backbone at 50-round training. Since no data or deep model sharing across different clients, the privacy issue is well-solved and the model security is guaranteed with no model inversion attack risk. Code is available at https://github.com/tianpeng-deng/FedDBL. △ Less

Submitted 17 December, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

arXiv:2302.02922 [pdf, other]

Joint Edge-Model Sparse Learning is Provably Efficient for Graph Neural Networks

Authors: Shuai Zhang, Meng Wang, Pin-Yu Chen, Sijia Liu, Songtao Lu, Miao Liu

Abstract: Due to the significant computational challenge of training large-scale graph neural networks (GNNs), various sparse learning techniques have been exploited to reduce memory and storage costs. Examples include \textit{graph sparsification} that samples a subgraph to reduce the amount of data aggregation and \textit{model sparsification} that prunes the neural network to reduce the number of trainab… ▽ More Due to the significant computational challenge of training large-scale graph neural networks (GNNs), various sparse learning techniques have been exploited to reduce memory and storage costs. Examples include \textit{graph sparsification} that samples a subgraph to reduce the amount of data aggregation and \textit{model sparsification} that prunes the neural network to reduce the number of trainable weights. Despite the empirical successes in reducing the training cost while maintaining the test accuracy, the theoretical generalization analysis of sparse learning for GNNs remains elusive. To the best of our knowledge, this paper provides the first theoretical characterization of joint edge-model sparse learning from the perspective of sample complexity and convergence rate in achieving zero generalization error. It proves analytically that both sampling important nodes and pruning neurons with the lowest-magnitude can reduce the sample complexity and improve convergence without compromising the test accuracy. Although the analysis is centered on two-layer GNNs with structural constraints on data, the insights are applicable to more general setups and justified by both synthetic and practical citation datasets. △ Less

Submitted 6 February, 2023; originally announced February 2023.

Journal ref: The Eleventh International Conference on Learning Representations, 2023

arXiv:2301.10606 [pdf, other]

A Holistic Cascade System, benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation

Authors: Wen-Chin Huang, Benjamin Peloquin, Justine Kao, Changhan Wang, Hongyu Gong, Elizabeth Salesky, Yossi Adi, Ann Lee, Peng-Jen Chen

Abstract: Expressive speech-to-speech translation (S2ST) aims to transfer prosodic attributes of source speech to target speech while maintaining translation accuracy. Existing research in expressive S2ST is limited, typically focusing on a single expressivity aspect at a time. Likewise, this research area lacks standard evaluation protocols and well-curated benchmark datasets. In this work, we propose a ho… ▽ More Expressive speech-to-speech translation (S2ST) aims to transfer prosodic attributes of source speech to target speech while maintaining translation accuracy. Existing research in expressive S2ST is limited, typically focusing on a single expressivity aspect at a time. Likewise, this research area lacks standard evaluation protocols and well-curated benchmark datasets. In this work, we propose a holistic cascade system for expressive S2ST, combining multiple prosody transfer techniques previously considered only in isolation. We curate a benchmark expressivity test set in the TV series domain and explored a second dataset in the audiobook domain. Finally, we present a human evaluation protocol to assess multiple expressive dimensions across speech pairs. Experimental results indicate that bi-lingual annotators can assess the quality of expressive preservation in S2ST systems, and the holistic modeling approach outperforms single-aspect systems. Audio samples can be accessed through our demo webpage: https://facebookresearch.github.io/speech_translation/cascade_expressive_s2st. △ Less

Submitted 25 January, 2023; originally announced January 2023.

Comments: This is the full version of our submission to ICASSP 2023

arXiv:2301.01915 [pdf, ps, other]

Sum-Rate Maximization in Active RIS-Assisted Multi-Antenna WPCN

Authors: Jie Jiang, Bin Lyu, Pengcheng Chen, Zhen Yang

Abstract: In this paper, we propose an active reconfigurable intelligent surface (RIS) enabled hybrid relaying scheme for a multi-antenna wireless powered communication network (WPCN), where the active RIS is employed to assist both wireless energy transfer (WET) from the power station (PS) to energy-constrained users and wireless information transmission (WIT) from users to the receiving station (RS). For… ▽ More In this paper, we propose an active reconfigurable intelligent surface (RIS) enabled hybrid relaying scheme for a multi-antenna wireless powered communication network (WPCN), where the active RIS is employed to assist both wireless energy transfer (WET) from the power station (PS) to energy-constrained users and wireless information transmission (WIT) from users to the receiving station (RS). For further performance enhancement, we propose to employ both transmit beamforming at the PS and receive beamforming at the RS. We formulate a sum-rate maximization problem by jointly optimizing the RIS phase shifts and amplitude reflection coefficients for both the WET and the WIT, transmit and receive beamforming vectors, and network resource allocation. To solve this non-convex problem, we propose an efficient alternating optimization algorithm with linear minimum mean squared error criterion, semi-definite relaxation (SDR) and successive convex approximation techniques. Specifically, the tightness of applying the SDR is proved. Simulation results demonstrate that our proposed scheme with 10 reflecting elements (REs) and 4 antennas can achieve 17.78% and 415.48% performance gains compared to the single-antenna scheme with 10 REs and passive RIS scheme with 100 REs, respectively. △ Less

Submitted 5 January, 2023; originally announced January 2023.

Comments: Accepted by China Communications

arXiv:2212.08055 [pdf, other]

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

Authors: Hirofumi Inaguma, Sravya Popuri, Ilia Kulikov, Peng-Jen Chen, Changhan Wang, Yu-An Chung, Yun Tang, Ann Lee, Shinji Watanabe, Juan Pino

Abstract: Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword predictio… ▽ More Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case. △ Less

Submitted 26 May, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

Comments: ACL 2023 (main conference)

arXiv:2212.04069 [pdf, other]

Reinforcement Learning for Resilient Power Grids

Authors: Zhenting Zhao, Po-Yen Chen, Yucheng **

Abstract: Traditional power grid systems have become obsolete under more frequent and extreme natural disasters. Reinforcement learning (RL) has been a promising solution for resilience given its successful history of power grid control. However, most power grid simulators and RL interfaces do not support simulation of power grid under large-scale blackouts or when the network is divided into sub-networks.… ▽ More Traditional power grid systems have become obsolete under more frequent and extreme natural disasters. Reinforcement learning (RL) has been a promising solution for resilience given its successful history of power grid control. However, most power grid simulators and RL interfaces do not support simulation of power grid under large-scale blackouts or when the network is divided into sub-networks. In this study, we proposed an updated power grid simulator built on Grid2Op, an existing simulator and RL interface, and experimented on limiting the action and observation spaces of Grid2Op. By testing with DDQN and SliceRDQN algorithms, we found that reduced action spaces significantly improve training performance and efficiency. In addition, we investigated a low-rank neural network regularization method for deep Q-learning, one of the most widely used RL algorithms, in this power grid control scenario. As a result, the experiment demonstrated that in the power grid simulation environment, adopting this method will significantly increase the performance of RL agents. △ Less

Submitted 7 December, 2022; originally announced December 2022.

Comments: 7 pages, 3 figures, 6 tables

Showing 1–50 of 141 results for author: Chen, P