Search | arXiv e-print repository

Coded Beam Training for RIS Assisted Wireless Communications

Abstract: Reconfigurable intelligent surface (RIS) is considered as one of the key technologies for future 6G communications. To fully unleash the performance of RIS, accurate channel state information (CSI) is crucial. Beam training is widely utilized to acquire the CSI. However, before aligning the beam correctly to establish stable connections, the signal-to-noise ratio (SNR) at UE is inevitably low, whi… ▽ More Reconfigurable intelligent surface (RIS) is considered as one of the key technologies for future 6G communications. To fully unleash the performance of RIS, accurate channel state information (CSI) is crucial. Beam training is widely utilized to acquire the CSI. However, before aligning the beam correctly to establish stable connections, the signal-to-noise ratio (SNR) at UE is inevitably low, which reduces the beam training accuracy. To deal with this problem, we exploit the coded beam training framework for RIS systems, which leverages the error correction capability of channel coding to improve the beam training accuracy under low SNR. Specifically, we first extend the coded beam training framework to RIS systems by decoupling the base station-RIS channel and the RIS-user channel. For this framework, codewords that accurately steer to multiple angles is essential for fully unleashing the error correction capability. In order to realize effective codeword design in RIS systems, we then propose a new codeword design criterion, based on which we propose a relaxed Gerchberg-Saxton (GS) based codeword design scheme by considering the constant modulus constraints of RIS elements. In addition, considering the two dimensional structure of RIS, we further propose a dimension reduced encoder design scheme, which can not only guarentee a better beam shape, but also enable a stronger error correction capability. Simulation results reveal that the proposed scheme can realize effective and accurate beam training in low SNR scenarios. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: In this paper, we exploit the coded beam training framework in RIS systems. By applying the idea of channel coding in the beam training process, we can leverage the error correction capability of channel coding to enhance the reliability of beam training under low SNR. Simulation codes will be provided at: http://oa.ee.tsinghua.edu.cn/dailinglong/publications/publications.html

arXiv:2406.12270 [pdf, other]

Sparse MIMO for ISAC: New Opportunities and Challenges

Authors: Xinrui Li, Hongqi Min, Yong Zeng, Shi **, Linglong Dai, Yifei Yuan, Rui Zhang

Abstract: Multiple-input multiple-output (MIMO) has been a key technology of wireless communications for decades. A typical MIMO system employs antenna arrays with the inter-antenna spacing being half of the signal wavelength, which we term as compact MIMO. Looking forward towards the future sixth-generation (6G) mobile communication networks, MIMO system will achieve even finer spatial resolution to not on… ▽ More Multiple-input multiple-output (MIMO) has been a key technology of wireless communications for decades. A typical MIMO system employs antenna arrays with the inter-antenna spacing being half of the signal wavelength, which we term as compact MIMO. Looking forward towards the future sixth-generation (6G) mobile communication networks, MIMO system will achieve even finer spatial resolution to not only enhance the spectral efficiency of wireless communications, but also enable more accurate wireless sensing. To this end, by removing the restriction of half-wavelength antenna spacing, sparse MIMO has been proposed as a new architecture that is able to significantly enlarge the array aperture as compared to conventional compact MIMO with the same number of array elements. In addition, sparse MIMO leads to a new form of virtual MIMO systems for sensing with their virtual apertures considerably larger than physical apertures. As sparse MIMO is expected to be a viable technology for 6G, we provide in this article a comprehensive overview of it, especially focusing on its appealing advantages for integrated sensing and communication (ISAC) towards 6G. Specifically, assorted sparse MIMO architectures are first introduced, followed by their new benefits as well as challenges. We then discuss the main design issues of sparse MIMO, including beam pattern synthesis, signal processing, grating lobe suppression, beam codebook design, and array geometry optimization. Last, we provide numerical results to evaluate the performance of sparse MIMO for ISAC and point out promising directions for future research. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.07989 [pdf, other]

Near-Field Wideband Beam Training Based on Distance-Dependent Beam Split

Authors: Tianyue Zheng, Mingyao Cui, Zidong Wu, Linglong Dai

Abstract: Near-field beam training is essential for acquiring channel state information in 6G extremely large-scale multiple input multiple output (XL-MIMO) systems. To achieve low-overhead beam training, existing method has been proposed to leverage the near-field beam split effect, which deploys true-time-delay arrays to simultaneously search multiple angles of the entire angular range in a distance ring… ▽ More Near-field beam training is essential for acquiring channel state information in 6G extremely large-scale multiple input multiple output (XL-MIMO) systems. To achieve low-overhead beam training, existing method has been proposed to leverage the near-field beam split effect, which deploys true-time-delay arrays to simultaneously search multiple angles of the entire angular range in a distance ring with a single pilot. However, the method still requires exhaustive search in the distance domain, which limits its efficiency. To address the problem, we propose a distance-dependent beam-split-based beam training method to further reduce the training overheads. Specifically, we first reveal the new phenomenon of distance-dependent beam split, where by manipulating the configurations of time-delay and phase-shift, beams at different frequencies can simultaneously scan the angular domain in multiple distance rings. Leveraging the phenomenon, we propose a near-field beam training method where both different angles and distances can simultaneously be searched in one time slot. Thus, a few pilots are capable of covering the whole angle-distance space for wideband XL-MIMO. Theoretical analysis and numerical simulations are also displayed to verify the superiority of the proposed method on beamforming gain and training overhead. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.05325 [pdf, other]

LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance

Authors: Shihao Chen, Yu Gu, Jie Zhang, Na Li, Rilin Chen, Li** Chen, Lirong Dai

Abstract: Any-to-any singing voice conversion (SVC) is an interesting audio editing technique, aiming to convert the singing voice of one singer into that of another, given only a few seconds of singing data. However, during the conversion process, the issue of timbre leakage is inevitable: the converted singing voice still sounds like the original singer's voice. To tackle this, we propose a latent diffusi… ▽ More Any-to-any singing voice conversion (SVC) is an interesting audio editing technique, aiming to convert the singing voice of one singer into that of another, given only a few seconds of singing data. However, during the conversion process, the issue of timbre leakage is inevitable: the converted singing voice still sounds like the original singer's voice. To tackle this, we propose a latent diffusion model for SVC (LDM-SVC) in this work, which attempts to perform SVC in the latent space using an LDM. We pretrain a variational autoencoder structure using the noted open-source So-VITS-SVC project based on the VITS framework, which is then used for the LDM training. Besides, we propose a singer guidance training method based on classifier-free guidance to further suppress the timbre of the original singer. Experimental results show the superiority of the proposed method over previous works in both subjective and objective evaluations of timbre similarity. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.04881 [pdf, other]

MIMO Capacity Analysis and Channel Estimation for Electromagnetic Information Theory

Authors: Jieao Zhu, Vincent Y. F. Tan, Linglong Dai

Abstract: Electromagnetic information theory (EIT) is an interdisciplinary subject that serves to integrate deterministic electromagnetic theory with stochastic Shannon's information theory. Existing EIT analysis operates in the continuous space domain, which is not aligned with the practical algorithms working in the discrete space domain. This mismatch leads to a significant difficulty in application of E… ▽ More Electromagnetic information theory (EIT) is an interdisciplinary subject that serves to integrate deterministic electromagnetic theory with stochastic Shannon's information theory. Existing EIT analysis operates in the continuous space domain, which is not aligned with the practical algorithms working in the discrete space domain. This mismatch leads to a significant difficulty in application of EIT methodologies to practical discrete space systems, which is called as the discrete-continuous gap in this paper. To bridge this gap, we establish the discrete-continuous correspondence with a prolate spheroidal wave function (PSWF)-based ergodic capacity analysis framework. Specifically, we state and prove some discrete-continuous correspondence lemmas to establish a firm theoretical connection between discrete information-theoretic quantities to their continuous counterparts. With these lemmas, we apply the PSWF ergodic capacity bound to advanced MIMO architectures such as continuous-aperture MIMO (CAP-MIMO) and extremely large-scale MIMO (XL-MIMO). From this PSWF capacity bound, we discover the capacity saturation phenomenon both theoretically and empirically. Although the growth of MIMO performance is fundamentally limited in this EIT-based analysis framework, we reveal new opportunities in MIMO channel estimation by exploiting the EIT knowledge about the channel. Inspired by the PSWF capacity bound, we utilize continuous PSWFs to improve the pilot design of discrete MIMO channel estimators, which is called as the PSWF channel estimator (PSWF-CE). Simulation results demonstrate improved performances of the proposed PSWF-CE, compared to traditional minimum mean squared error (MMSE) and compressed sensing-based estimators. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Submitted to the IEEE TWC. In this paper, we established the discrete-continuous correspondence for electromagnetic information theory (EIT), thus enabling analytical tools in the continuous space domain to be applied to discrete space MIMO architectures. Simulation codes will be provided at http://oa.ee.tsinghua.edu.cn/dailinglong/publications/publications.html

arXiv:2405.11352 [pdf, other]

Hierarchical Reinforcement Learning Empowered Task Offloading in V2I Networks

Authors: Xinyu You, Haojie Yan, Yuedong Xu, Lifeng Wang, Liangui Dai

Abstract: Edge computing plays an essential role in the vehicle-to-infrastructure (V2I) networks, where vehicles offload their intensive computation tasks to the road-side units for saving energy and reduce the latency. This paper designs the optimal task offloading policy to address the concerns involving processing delay, energy consumption and edge computing cost. Each computation task consisting of some… ▽ More Edge computing plays an essential role in the vehicle-to-infrastructure (V2I) networks, where vehicles offload their intensive computation tasks to the road-side units for saving energy and reduce the latency. This paper designs the optimal task offloading policy to address the concerns involving processing delay, energy consumption and edge computing cost. Each computation task consisting of some interdependent sub-tasks is characterized as a directed acyclic graph (DAG). In such dynamic networks, a novel hierarchical Offloading scheme is proposed by leveraging deep reinforcement learning (DRL). The inter-dependencies among the DAGs of the computation tasks are extracted using a graph neural network with attention mechanism. A parameterized DRL algorithm is developed to deal with the hierarchical action space containing both discrete and continuous actions. Simulation results with a real-world car speed dataset demonstrate that the proposed scheme can effectively reduce the system overhead. △ Less

Submitted 18 May, 2024; originally announced May 2024.

arXiv:2405.10496 [pdf, other]

Electromagnetic Information Theory for Holographic MIMO Communications

Authors: Li Wei, Tierui Gong, Chongwen Huang, Zhaoyang Zhang, Wei E. I. Sha, Zhi Ning Chen, Linglong Dai, Merouane Debbah, Chau Yuen

Abstract: Holographic multiple-input multiple-output (HMIMO) utilizes a compact antenna array to form a nearly continuous aperture, thereby enhancing higher capacity and more flexible configurations compared with conventional MIMO systems, making it attractive in current scientific research. Key questions naturally arise regarding the potential of HMIMO to surpass Shannon's theoretical limits and how far it… ▽ More Holographic multiple-input multiple-output (HMIMO) utilizes a compact antenna array to form a nearly continuous aperture, thereby enhancing higher capacity and more flexible configurations compared with conventional MIMO systems, making it attractive in current scientific research. Key questions naturally arise regarding the potential of HMIMO to surpass Shannon's theoretical limits and how far its capabilities can be extended. However, the traditional Shannon information theory falls short in addressing these inquiries because it only focuses on the information itself while neglecting the underlying carrier, electromagnetic (EM) waves, and environmental interactions. To fill up the gap between the theoretical analysis and the practical application for HMIMO systems, we introduce electromagnetic information theory (EIT) in this paper. This paper begins by laying the foundation for HMIMO-oriented EIT, encompassing EM wave equations and communication regions. In the context of HMIMO systems, the resultant physical limitations are presented, involving Chu's limit, Harrington's limit, Hannan's limit, and the evaluation of coupling effects. Field sampling and HMIMO-assisted oversampling are also discussed to guide the optimal HMIMO design within the EIT framework. To comprehensively depict the EM-compliant propagation process, we present the approximate and exact channel modeling approaches in near-/far-field zones. Furthermore, we discuss both traditional Shannon's information theory, employing the probabilistic method, and Kolmogorov information theory, utilizing the functional analysis, for HMIMO-oriented EIT systems. △ Less

Submitted 25 May, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

arXiv:2404.06806 [pdf, other]

Near-Optimal Channel Estimation for Dense Array Systems

Authors: Mingyao Cui, Zijian Zhang, Linglong Dai, Kaibin Huang

Abstract: By deploying a large number of antennas with sub-half-wavelength spacing in a compact space, dense array systems(DASs) can fully unleash the multiplexing-and-diversity gains of limited apertures. To acquire these gains, accurate channel state information acquisition is necessary but challenging due to the large antenna numbers. To overcome this obstacle, this paper reveals that exploiting the high… ▽ More By deploying a large number of antennas with sub-half-wavelength spacing in a compact space, dense array systems(DASs) can fully unleash the multiplexing-and-diversity gains of limited apertures. To acquire these gains, accurate channel state information acquisition is necessary but challenging due to the large antenna numbers. To overcome this obstacle, this paper reveals that exploiting the high spatial correlation of DAS channels is crucial while designing the observation matrix for optimal/near-optimal channel estimation. Firstly, we prove that the observation matrix design is equivalent to a time-domain duality of multiple-input multiple-output precoding, which can be ideally addressed by the water-filling principle. For practical realizations, a novel ice-filling algorithm is proposed to design amplitude-and-phase controllable observation matrices, and a majorization-minimization algorithm is proposed to address the phase-only controllable case. Particularly, we prove that the ice-filling algorithm can be viewed as a ``quantized" water-filling algorithm. To support the sub-optimality of the proposed designs, we provide comprehensive analyses on the achievable mean square errors and their asymptotic expressions. Finally, numerical simulations verify that our proposed channel estimation designs can achieve the near-optimal performance and outperform existing approaches significantly. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: 19 pages, 10 figures

arXiv:2403.17770 [pdf, other]

CT Synthesis with Conditional Diffusion Models for Abdominal Lymph Node Segmentation

Authors: Yongrui Yu, Hanyu Chen, Zitian Zhang, Qiong Xiao, Wenhui Lei, Linrui Dai, Yu Fu, Hui Tan, Guan Wang, Peng Gao, Xiaofan Zhang

Abstract: Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node… ▽ More Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node generation and the nnU-Net model for lymph node segmentation to improve the segmentation performance of abdominal lymph nodes through synthesizing a diversity of realistic abdominal lymph node data. We propose LN-DDPM, a conditional denoising diffusion probabilistic model (DDPM) for lymph node (LN) generation. LN-DDPM utilizes lymph node masks and anatomical structure masks as model conditions. These conditions work in two conditioning mechanisms: global structure conditioning and local detail conditioning, to distinguish between lymph nodes and their surroundings and better capture lymph node characteristics. The obtained paired abdominal lymph node images and masks are used for the downstream segmentation task. Experimental results on the abdominal lymph node datasets demonstrate that LN-DDPM outperforms other generative methods in the abdominal lymph node image synthesis and better assists the downstream abdominal lymph node segmentation task. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.16062 [pdf]

Holography inspired self-controlled reconfigurable intelligent surface

Authors: Jieao Zhu, Ze Gu, Qian Ma, Linglong Dai, Tie Jun Cui

Abstract: Among various promising candidate technologies for the sixth-generation (6G) wireless communications, recent advances in microwave metasurfaces have sparked a new research area of reconfigurable intelligent surfaces (RISs). By controllably reprogramming the wireless propagation channel, RISs are envisioned to achieve low-cost wireless capacity boosting, coverage extension, and enhanced energy effi… ▽ More Among various promising candidate technologies for the sixth-generation (6G) wireless communications, recent advances in microwave metasurfaces have sparked a new research area of reconfigurable intelligent surfaces (RISs). By controllably reprogramming the wireless propagation channel, RISs are envisioned to achieve low-cost wireless capacity boosting, coverage extension, and enhanced energy efficiency. To reprogram the channel, each meta-atom on RIS needs an external control signal, which is usually generated by base station (BS). However, BS-controlled RISs require complicated control cables, which hamper their massive deployments. Here, we eliminate the need for BS control by proposing a self-controlled RIS (SC-RIS), which is inspired by the optical holography principle. Different from the existing BS-controlled RISs, each meta-atom of SC-RIS is integrated with an additional power detector for holographic recording. By applying the classical Fourier-transform processing to the measured hologram, SC-RIS is capable of retrieving the user's channel state information required for beamforming, thus enabling autonomous RIS beamforming without control cables. Owing to this WiFi-like plug-and-play capability without the BS control, SC-RISs are expected to enable easy and massive deployments in the future 6G systems. △ Less

Submitted 24 March, 2024; originally announced March 2024.

Comments: Traditional BS-controlled RISs suffer from complicated control cables. To "cut" the control cables, we propose a self-controlled RIS by leveraging the holographic interference principle, thus realizing autonomous RIS beamforming

arXiv:2403.12268 [pdf, other]

Near-Field Channel Modeling for Electromagnetic Information Theory

Authors: Zhongzhichao Wan, Jieao Zhu, Linglong Dai

Abstract: Electromagnetic information theory (EIT) is one of the emerging topics for 6G communication due to its potential to reveal the performance limit of wireless communication systems. For EIT, the research foundation is reasonable and accurate channel modeling. Existing channel modeling works for EIT in non-line-of-sight (NLoS) scenario focus on far-field modeling, which can not accurately capture the… ▽ More Electromagnetic information theory (EIT) is one of the emerging topics for 6G communication due to its potential to reveal the performance limit of wireless communication systems. For EIT, the research foundation is reasonable and accurate channel modeling. Existing channel modeling works for EIT in non-line-of-sight (NLoS) scenario focus on far-field modeling, which can not accurately capture the characteristics of the channel in near-field. In this paper, we propose the near-field channel model for EIT based on electromagnetic scattering theory. We model the channel by using non-stationary Gaussian random fields and derive the analytical expression of the correlation function of the fields. Furthermore, we analyze the characteristics of the proposed channel model, e.g., channel degrees of freedom (DoF). Finally, we design a channel estimation scheme for near-field scenario by integrating the electromagnetic prior information of the proposed model. Numerical analysis verifies the correctness of the proposed scheme and shows that it can outperform existing schemes like least square (LS) and orthogonal matching pursuit (OMP). △ Less

Submitted 26 May, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: In this paper, we propose the near-field channel model for EIT based on electromagnetic scattering theory. Then, we derive the analytical expression of the correlation function of the fields and analyze the characteristics of it. Finally, we design a channel estimation scheme for near-field scenario

arXiv:2403.07247 [pdf, other]

GuideGen: A Text-guided Framework for Joint CT Volume and Anatomical structure Generation

Authors: Linrui Dai, Rongzhao Zhang, Zhongzhen Huang, Xiaofan Zhang

Abstract: The annotation burden and extensive labor for gathering a large medical dataset with images and corresponding labels are rarely cost-effective and highly intimidating. This results in a lack of abundant training data that undermines downstream tasks and partially contributes to the challenge image analysis faces in the medical field. As a workaround, given the recent success of generative neural m… ▽ More The annotation burden and extensive labor for gathering a large medical dataset with images and corresponding labels are rarely cost-effective and highly intimidating. This results in a lack of abundant training data that undermines downstream tasks and partially contributes to the challenge image analysis faces in the medical field. As a workaround, given the recent success of generative neural models, it is now possible to synthesize image datasets at a high fidelity guided by external constraints. This paper explores this possibility and presents \textbf{GuideGen}: a pipeline that jointly generates CT images and tissue masks for abdominal organs and colorectal cancer conditioned on a text prompt. Firstly, we introduce Volumetric Mask Sampler to fit the discrete distribution of mask labels and generate low-resolution 3D tissue masks. Secondly, our Conditional Image Generator autoregressively generates CT slices conditioned on a corresponding mask slice to incorporate both style information and anatomical guidance. This pipeline guarantees high fidelity and variability as well as exact alignment between generated CT volumes and tissue masks. Both qualitative and quantitative experiments on 3D abdominal CTs demonstrate a high performance of our proposed pipeline, thereby proving our method can serve as a dataset generator and provide potential benefits to downstream tasks. It is hoped that our work will offer a promising solution on the multimodality generation of CT and its anatomical mask. Our source code is publicly available at https://github.com/OvO1111/JointImageGeneration. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: submitted to MICCAI2024

arXiv:2403.05970 [pdf, other]

Electromagnetic Hybrid Beamforming for Holographic Communications

Authors: Ran Ji, Chongwen Huang, Xiaoming Chen, Wei E. I. Sha, Linglong Dai, Jiguang He, Zhaoyang Zhang, Chau Yuen, Mérouane Debbah

Abstract: It is well known that there is inherent radiation pattern distortion for the commercial base station antenna array, which usually needs three antenna sectors to cover the whole space. To eliminate pattern distortion and further enhance beamforming performance, we propose an electromagnetic hybrid beamforming (EHB) scheme based on a three-dimensional (3D) superdirective holographic antenna array. S… ▽ More It is well known that there is inherent radiation pattern distortion for the commercial base station antenna array, which usually needs three antenna sectors to cover the whole space. To eliminate pattern distortion and further enhance beamforming performance, we propose an electromagnetic hybrid beamforming (EHB) scheme based on a three-dimensional (3D) superdirective holographic antenna array. Specifically, EHB consists of antenna excitation current vectors (analog beamforming) and digital precoding matrices, where the implementation of analog beamforming involves the real-time adjustment of the radiation pattern to adapt it to the dynamic wireless environment. Meanwhile, the digital beamforming is optimized based on the channel characteristics of analog beamforming to further improve the achievable rate of communication systems. An electromagnetic channel model incorporating array radiation patterns and the mutual coupling effect is also developed to evaluate the benefits of our proposed scheme. Simulation results demonstrate that our proposed EHB scheme with a 3D holographic array achieves a relatively flat superdirective beamforming gain and allows for programmable focusing directions throughout the entire spatial domain. Furthermore, they also verify that the proposed scheme achieves a sum rate gain of over 150% compared to traditional beamforming algorithms. △ Less

Submitted 9 March, 2024; originally announced March 2024.

Comments: 13 pages

arXiv:2402.02688 [pdf, ps, other]

Successive Bayesian Reconstructor for FAS Channel Estimation

Authors: Zijian Zhang, Jieao Zhu, Linglong Dai, Robert W. Heath Jr

Abstract: Fluid antenna systems (FASs) can reconfigure their locations freely within a spatially continuous space. To keep favorable antenna positions, the channel state information (CSI) acquisition for FASs is essential. While some techniques have been proposed, most existing FAS channel estimators require several channel assumptions, such as slow variation and angular-domain sparsity. When these assumpti… ▽ More Fluid antenna systems (FASs) can reconfigure their locations freely within a spatially continuous space. To keep favorable antenna positions, the channel state information (CSI) acquisition for FASs is essential. While some techniques have been proposed, most existing FAS channel estimators require several channel assumptions, such as slow variation and angular-domain sparsity. When these assumptions are not reasonable, the model mismatch may lead to unpredictable performance loss. In this paper, we propose the successive Bayesian reconstructor (S-BAR) as a general solution to estimate FAS channels. Unlike model-based estimators, the proposed S-BAR is prior-aided, which builds the experiential kernel for CSI acquisition. Inspired by Bayesian regression, the key idea of S-BAR is to model the FAS channels as a stochastic process, whose uncertainty can be successively eliminated by kernel-based sampling and regression. In this way, the predictive mean of the regressed stochastic process can be viewed as the maximum a posterior (MAP) estimator of FAS channels. Simulation results verify that, in both model-mismatched and model-matched cases, the proposed S-BAR can achieve higher estimation accuracy than the existing schemes. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: Accepted by IEEE WCNC 2024. This paper proposes S-BAR as a general solution to estimate FAS channels. More insights can be found in the journal version of this paper: arXiv:2312.06551. arXiv admin note: substantial text overlap with arXiv:2312.06551

arXiv:2401.11857 [pdf, other]

doi 10.1109/ICASSP48485.2024.10447699

Adversarial speech for voice privacy protection from Personalized Speech generation

Authors: Shihao Chen, Li** Chen, Jie Zhang, KongAik Lee, Zhenhua Ling, Lirong Dai

Abstract: The rapid progress in personalized speech generation technology, including personalized text-to-speech (TTS) and voice conversion (VC), poses a challenge in distinguishing between generated and real speech for human listeners, resulting in an urgent demand in protecting speakers' voices from malicious misuse. In this regard, we propose a speaker protection method based on adversarial attacks. The… ▽ More The rapid progress in personalized speech generation technology, including personalized text-to-speech (TTS) and voice conversion (VC), poses a challenge in distinguishing between generated and real speech for human listeners, resulting in an urgent demand in protecting speakers' voices from malicious misuse. In this regard, we propose a speaker protection method based on adversarial attacks. The proposed method perturbs speech signals by minimally altering the original speech while rendering downstream speech generation models unable to accurately generate the voice of the target speaker. For validation, we employ the open-source pre-trained YourTTS model for speech generation and protect the target speaker's speech in the white-box scenario. Automatic speaker verification (ASV) evaluations were carried out on the generated speech as the assessment of the voice protection capability. Our experimental results show that we successfully perturbed the speaker encoder of the YourTTS model using the gradient-based I-FGSM adversarial perturbation method. Furthermore, the adversarial perturbation is effective in preventing the YourTTS model from generating the speech of the target speaker. Audio samples can be found in https://voiceprivacy.github.io/Adeversarial-Speech-with-YourTTS. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: Accepted by icassp 2024

arXiv:2401.03468 [pdf, other]

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Authors: Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai

Abstract: Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech proces… ▽ More Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multichannel audio streams and a visual stream in parallel, with intra- and inter-channel contrastive losses as training targets to fully exploit the spatiotemporal information in multichannel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of speech representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks. △ Less

Submitted 7 January, 2024; originally announced January 2024.

Comments: Accepted by AAAI 2024

arXiv:2401.01673 [pdf, other]

Coded Beam Training

Authors: Tianyue Zheng, Jieao Zhu, Qiumo Yu, Yongli Yan, Linglong Dai

Abstract: In extremely large-scale multiple input multiple output (XL-MIMO) systems for future sixth-generation (6G) communications, codebook-based beam training stands out as a promising technology to acquire channel state information (CSI). Despite their effectiveness, when the pilot overhead is limited, existing beam training methods suffer from significant achievable rate degradation for remote users wi… ▽ More In extremely large-scale multiple input multiple output (XL-MIMO) systems for future sixth-generation (6G) communications, codebook-based beam training stands out as a promising technology to acquire channel state information (CSI). Despite their effectiveness, when the pilot overhead is limited, existing beam training methods suffer from significant achievable rate degradation for remote users with low signal-to-noise ratio (SNR). To tackle this challenge, leveraging the error-correcting capability of channel codes, we introduce channel coding theory into hierarchical beam training to extend the coverage area. Specifically, we establish the duality between hierarchical beam training and channel coding, and the proposed coded beam training scheme serves as a general framework. Then, we present two specific implementations exemplified by coded beam training methods based on Hamming codes and convolutional codes, during which the beam encoding and decoding processes are refined respectively to better accommodate the beam training problem. Simulation results have demonstrated that the proposed coded beam training method can enable reliable beam training performance for remote users with low SNR while kee** training overhead low. △ Less

Submitted 6 March, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

Comments: In this paper, we introduce channel coding theory into hierarchical beam training and propose a beam training scheme called coded beam training. By leveraging the error-correcting capability of channel codes, the proposed coded beam training method can enable reliable beam training performance for remote users with low SNR, while kee** training overhead low

arXiv:2312.06551 [pdf, ps, other]

Successive Bayesian Reconstructor for Channel Estimation in Fluid Antenna Systems

Authors: Zijian Zhang, Jieao Zhu, Linglong Dai, Robert W. Heath Jr

Abstract: Fluid antenna systems (FASs) can reconfigure their antenna locations freely within a spatially continuous space. To keep favorable antenna positions, the channel state information (CSI) acquisition for FASs is essential. While some techniques have been proposed, most existing FAS channel estimators require several channel assumptions, such as slow variation and angular-domain sparsity. When these… ▽ More Fluid antenna systems (FASs) can reconfigure their antenna locations freely within a spatially continuous space. To keep favorable antenna positions, the channel state information (CSI) acquisition for FASs is essential. While some techniques have been proposed, most existing FAS channel estimators require several channel assumptions, such as slow variation and angular-domain sparsity. When these assumptions are not reasonable, the model mismatch may lead to unpredictable performance loss. In this paper, we propose the successive Bayesian reconstructor (S-BAR) as a general solution to estimate FAS channels. Unlike model-based estimators, the proposed S-BAR is prior-aided, which builds the experiential kernel for CSI acquisition. Inspired by Bayesian regression, the key idea of S-BAR is to model the FAS channels as a stochastic process, whose uncertainty can be successively eliminated by kernel-based sampling and regression. In this way, the predictive mean of the regressed stochastic process can be viewed as the maximum a posterior (MAP) estimator of FAS channels. Simulation results verify that, in both model-mismatched and model-matched cases, the proposed S-BAR can achieve higher estimation accuracy than the existing schemes. △ Less

Submitted 17 January, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

Comments: 13 pages, 8 figures. This paper proposes S-BAR as a general solution to estimate FAS channels. Unlike model-based estimators, the proposed S-BAR is prior-aided, which builds the experiential kernel for CSI acquisition. Simulation codes will be provided at: http://oa.ee.tsinghua.edu.cn/dailinglong/publications/publications.html

arXiv:2311.08024 [pdf, other]

MD-IQA: Learning Multi-scale Distributed Image Quality Assessment with Semi Supervised Learning for Low Dose CT

Authors: Tao Song, Ruizhi Hou, Lisong Dai, Lei Xiang

Abstract: Image quality assessment (IQA) plays a critical role in optimizing radiation dose and develo** novel medical imaging techniques in computed tomography (CT). Traditional IQA methods relying on hand-crafted features have limitations in summarizing the subjective perceptual experience of image quality. Recent deep learning-based approaches have demonstrated strong modeling capabilities and potentia… ▽ More Image quality assessment (IQA) plays a critical role in optimizing radiation dose and develo** novel medical imaging techniques in computed tomography (CT). Traditional IQA methods relying on hand-crafted features have limitations in summarizing the subjective perceptual experience of image quality. Recent deep learning-based approaches have demonstrated strong modeling capabilities and potential for medical IQA, but challenges remain regarding model generalization and perceptual accuracy. In this work, we propose a multi-scale distributions regression approach to predict quality scores by constraining the output distribution, thereby improving model generalization. Furthermore, we design a dual-branch alignment network to enhance feature extraction capabilities. Additionally, semi-supervised learning is introduced by utilizing pseudo-labels for unlabeled data to guide model training. Extensive qualitative experiments demonstrate the effectiveness of our proposed method for advancing the state-of-the-art in deep learning-based medical IQA. Code is available at: https://github.com/zunzhumu/MD-IQA. △ Less

Submitted 14 November, 2023; originally announced November 2023.

arXiv:2310.15901 [pdf, other]

Enhancing Energy Efficiency for Reconfigurable Intelligent Surfaces with Practical Power Models

Authors: Zhiyi Li, Jida Zhang, Jieao Zhu, Shi **, Linglong Dai

Abstract: Reconfigurable intelligent surfaces (RISs) are widely considered a promising technology for future wireless communication systems. As an important indicator of RIS-assisted communication systems in green wireless communications, energy efficiency (EE) has recently received intensive research interest as an optimization target. However, most previous works have ignored the different power consumpti… ▽ More Reconfigurable intelligent surfaces (RISs) are widely considered a promising technology for future wireless communication systems. As an important indicator of RIS-assisted communication systems in green wireless communications, energy efficiency (EE) has recently received intensive research interest as an optimization target. However, most previous works have ignored the different power consumption between ON and OFF states of the PIN diodes attached to each RIS element. This oversight results in extensive unnecessary power consumption and reduction of actual EE due to the inaccurate power model. To address this issue, in this paper, we first utilize a practical power model for a RIS-assisted multi-user multiple-input single-output (MU-MISO) communication system, which takes into account the difference in power dissipation caused by ON-OFF states of RIS's PIN diodes. Based on this model, we formulate a more accurate EE optimization problem. However, this problem is non-convex and has mixed-integer properties, which poses a challenge for optimization. To solve the problem, an effective alternating optimization (AO) algorithm framework is utilized to optimize the base station and RIS beamforming precoder separately. To obtain the essential RIS beamforming precoder, we develop two effective methods based on maximum gradient search and SDP relaxation respectively. Theoretical analysis shows the exponential complexity of the original problem has been reduced to polynomial complexity. Simulation results demonstrate that the proposed algorithm outperforms the existing ones, leading to a significant increase in EE across a diverse set of scenarios. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: Reconfigurable intelligent surface is a promising 6G technology. However, RIS power models are inaccurate. In this paper, we construct a practical power model for RIS communication systems with an SDP-relaxation algorithm, achieving optimal energy efficiency

arXiv:2310.12446 [pdf, other]

Can Electromagnetic Information Theory Improve Wireless Systems? A Channel Estimation Example

Authors: Jieao Zhu, Zhongzhichao Wan, Linglong Dai, Tie Jun Cui

Abstract: Electromagnetic information theory (EIT) is an emerging interdisciplinary subject that integrates classical Maxwell electromagnetics and Shannon information theory. The goal of EIT is to uncover the information transmission mechanisms from an electromagnetic (EM) perspective in wireless systems. Existing works on EIT are mainly focused on the analysis of EM channel characteristics, degrees-of-free… ▽ More Electromagnetic information theory (EIT) is an emerging interdisciplinary subject that integrates classical Maxwell electromagnetics and Shannon information theory. The goal of EIT is to uncover the information transmission mechanisms from an electromagnetic (EM) perspective in wireless systems. Existing works on EIT are mainly focused on the analysis of EM channel characteristics, degrees-of-freedom, and system capacity. However, these works do not clarify whether EIT can improve wireless communication systems. To fill in this gap, in this paper, we provide a novel example that EIT can improve the performance of classical minimum mean squared error (MMSE) channel estimators by replacing the channel covariance matrix with an EM correlation function (EMCF). Specifically, by averaging the solutions of Maxwell's equations over a tunable angular distribution, we obtain a spatio-temporal correlation function (STCF) of the EM channel, which we name as the EMCF. Since classical MMSE estimators can exploit prior information contained in the channel covariance matrix, the substitution of EMCF for the covariance matrix introduces EM side information into MMSE estimators. Furthermore, we dynamically tune the EMCF parameters to better fit the channel observations. Simulation results show that the proposed EIT-MMSE channel estimator outperforms traditional MMSE estimators, thus proving that EIT is beneficial to wireless communication systems. △ Less

Submitted 6 February, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

Comments: Electromagnetic information theory (EIT) is an emerging interdisciplinary subject, aiming at providing a unified analytical framework for wireless systems as well as guiding practical system design. This paper answers the question: "Whether can we improve wireless communication systems via EIT"?

arXiv:2310.00687 [pdf, ps, other]

DISCO Might Not Be Funky: Random Intelligent Reflective Surface Configurations That Attack

Authors: Huan Huang, Lipeng Dai, Hongliang Zhang, Chongfu Zhang, Zhongxing Tian, Yi Cai, A. Lee Swindlehurst, Zhu Han

Abstract: Emerging intelligent reflective surfaces (IRSs) significantly improve system performance, but also pose a significant risk for physical layer security (PLS). Unlike the extensive research on legitimate IRS-enhanced communications, in this article we present an adversarial IRS-based fully-passive jammer (FPJ). We describe typical application scenarios for Disco IRS (DIRS)-based FPJ, where an illegi… ▽ More Emerging intelligent reflective surfaces (IRSs) significantly improve system performance, but also pose a significant risk for physical layer security (PLS). Unlike the extensive research on legitimate IRS-enhanced communications, in this article we present an adversarial IRS-based fully-passive jammer (FPJ). We describe typical application scenarios for Disco IRS (DIRS)-based FPJ, where an illegitimate IRS with random, time-varying reflection properties acts like a "disco ball" to randomly change the propagation environment. We introduce the principles of DIRS-based FPJ and overview existing investigations of the technology, including a design example employing one-bit phase shifters. The DIRS-based FPJ can be implemented without either jamming power or channel state information (CSI) for the legitimate users (LUs). It does not suffer from the energy constraints of traditional active jammers, nor does it require any knowledge of the LU channels. In addition to the proposed jamming attack, we also propose an anti-jamming strategy that requires only statistical rather than instantaneous CSI. Furthermore, we present a data frame structure that enables the legitimate access point (AP) to estimate the DIRS-jammed channels' statistical characteristics in the presence of the DIRS jamming. Typical cases are discussed to show the impact of the DIRS-based FPJ and the feasibility of the anti-jamming precoder (AJP). Moreover, we outline future research directions and challenges for the DIRS-based FPJ and its anti-jamming precoding to stimulate this line of research and pave the way for practical applications. △ Less

Submitted 10 June, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

Comments: This paper has been accepted by IEEE Wireless Communications. For the code of the DISCO RIS is available on Github (https://github.com/huanhuan1799/Disco-Intelligent-Reflecting-Surfaces-Active-Channel-Aging-for-Fully-Passive-Jamming-Attacks)

arXiv:2309.09242 [pdf, ps, other]

Toward Beamfocusing-Aided Near-Field Communications: Research Advances, Potential, and Challenges

Authors: Jiancheng An, Chau Yuen, Linglong Dai, Marco Di Renzo, Merouane Debbah, Lajos Hanzo

Abstract: Next-generation mobile networks promise to support high throughput, massive connectivity, and improved energy efficiency. To achieve these ambitious goals, extremely large-scale antenna arrays (ELAAs) and terahertz communications constitute a pair of promising technologies. This will result in future wireless communications occurring in the near-field regions. To accurately portray the channel cha… ▽ More Next-generation mobile networks promise to support high throughput, massive connectivity, and improved energy efficiency. To achieve these ambitious goals, extremely large-scale antenna arrays (ELAAs) and terahertz communications constitute a pair of promising technologies. This will result in future wireless communications occurring in the near-field regions. To accurately portray the channel characteristics of near-field wireless propagation, spherical wavefront-based models are required and present both opportunities as well as challenges. Following the basics of near-field communications (NFC), we contrast it to conventional far-field communications. Moreover, we cover the key challenges of NFC, including its channel modeling and estimation, near-field beamfocusing, as well as hardware design. Our numerical results demonstrate the potential of NFC in improving the spatial multiplexing gain and positioning accuracy. Finally, a suite of open issues are identified for motivating future research. △ Less

Submitted 17 September, 2023; originally announced September 2023.

Comments: 8 pages, 5 figures, 1 table

arXiv:2308.15716 [pdf, ps, other]

Anti-Jamming Precoding Against Disco Intelligent Reflecting Surfaces Based Fully-Passive Jamming Attacks

Authors: Huan Huang, Lipeng Dai, Hongliang Zhang, Zhongxing Tian, Yi Cai, Chongfu Zhang, A. Lee Swindlehurst, Zhu Han

Abstract: Emerging intelligent reflecting surfaces (IRSs) significantly improve system performance, but also pose a huge risk for physical layer security. Existing works have illustrated that a disco IRS (DIRS), i.e., an illegitimate IRS with random time-varying reflection properties (like a "disco ball"), can be employed by an attacker to actively age the channels of legitimate users (LUs). Such active cha… ▽ More Emerging intelligent reflecting surfaces (IRSs) significantly improve system performance, but also pose a huge risk for physical layer security. Existing works have illustrated that a disco IRS (DIRS), i.e., an illegitimate IRS with random time-varying reflection properties (like a "disco ball"), can be employed by an attacker to actively age the channels of legitimate users (LUs). Such active channel aging (ACA) generated by the DIRS can be employed to jam multi-user multiple-input single-output (MU-MISO) systems without relying on either jamming power or LU channel state information (CSI). To address the significant threats posed by DIRS-based fully-passive jammers (FPJs), an anti-jamming precoder is proposed that requires only the statistical characteristics of the DIRS-based ACA channels instead of their CSI. The statistical characteristics of DIRS-jammed channels are first derived, and then the anti-jamming precoder is derived based on the statistical characteristics. Furthermore, we prove that the anti-jamming precoder can achieve the maximum signal-to-jamming-plus-noise ratio (SJNR). To acquire the ACA statistics without changing the system architecture or cooperating with the illegitimate DIRS, we design a data frame structure that the legitimate access point (AP) can use to estimate the statistical characteristics. During the designed data frame, the LUs only need to feed back their received power to the legitimate AP when they detect jamming attacks. Numerical results are also presented to evaluate the effectiveness of the proposed anti-jamming precoder against the DIRS-based FPJs and the feasibility of the designed data frame used by the legitimate AP to estimate the statistical characteristics. △ Less

Submitted 24 January, 2024; v1 submitted 29 August, 2023; originally announced August 2023.

Comments: This paper has been submitted for possible publication

arXiv:2308.14553 [pdf, other]

Rep2wav: Noise Robust text-to-speech Using self-supervised representations

Authors: Qiushi Zhu, Yu Gu, Rilin Chen, Chao Weng, Yuchen Hu, Lirong Dai, Jie Zhang

Abstract: Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background… ▽ More Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech. Meanwhile, it was shown that self-supervised pre-trained models exhibit excellent noise robustness on many speech tasks, implying that the learned representation has a better tolerance for noise perturbations. In this work, we therefore explore pre-trained models to improve the noise robustness of TTS models. Based on HiFi-GAN, we first propose a representation-to-waveform vocoder, which aims to learn to map the representation of pre-trained models to the waveform. We then propose a text-to-representation FastSpeech2 model, which aims to learn to map text to pre-trained model representations. Experimental results on the LJSpeech and LibriTTS datasets show that our method outperforms those using speech enhancement methods in both subjective and objective metrics. Audio samples are available at: https://zqs01.github.io/rep2wav. △ Less

Submitted 3 September, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

Comments: 5 pages,2 figures

arXiv:2307.16518 [pdf, other]

Continuous-Time Channel Prediction Based on Tensor Neural Ordinary Differential Equation

Authors: Mingyao Cui, Hao Jiang, Yuhao Chen, Yang Du, Linglong Dai

Abstract: Channel prediction is critical to address the channel aging issue in mobile scenarios. Existing channel prediction techniques are mainly designed for discrete channel prediction, which can only predict the future channel in a fixed time slot per frame, while the other intra-frame channels are usually recovered by interpolation. However, these approaches suffer from a serious interpolation loss, es… ▽ More Channel prediction is critical to address the channel aging issue in mobile scenarios. Existing channel prediction techniques are mainly designed for discrete channel prediction, which can only predict the future channel in a fixed time slot per frame, while the other intra-frame channels are usually recovered by interpolation. However, these approaches suffer from a serious interpolation loss, especially for mobile millimeter wave communications. To solve this challenging problem, we propose a tensor neural ordinary differential equation (TN-ODE) based continuous-time channel prediction scheme to realize the direct prediction of intra-frame channels. Specifically, inspired by the recently developed continuous map** model named neural ODE in the field of machine learning, we first utilize the neural ODE model to predict future continuous-time channels. To improve the channel prediction accuracy and reduce computational complexity, we then propose the TN-ODE scheme to learn the structural characteristics of the high-dimensional channel by low dimensional learnable transform. Simulation results show that the proposed scheme is able to achieve higher intra-frame channel prediction accuracy than existing schemes. △ Less

Submitted 31 July, 2023; originally announced July 2023.

Comments: A tensor neural ODE based method is proposed to predict continuous-time wireless channels

arXiv:2307.12307 [pdf, other]

Robust Weighted Sum-Rate Maximization for Transmissive RIS Transmitter Enabled RSMA Networks

Authors: Bojiang Li, Wen Chen, Zhendong Li, Qingqing Wu, Nan Cheng, Changle Li, Linglong Dai

Abstract: Due to the low power consumption and low cost nature of transmissive reconfigurable intelligent surface (RIS),in this paper, we propose a downlink multi-user rate-splitting multiple access (RSMA) architecture based on the transmissive RIS transmitter, where the channel state information (CSI) is only accquired partially. We investigate the weighted sum-rate maximization problem by jointly optimizi… ▽ More Due to the low power consumption and low cost nature of transmissive reconfigurable intelligent surface (RIS),in this paper, we propose a downlink multi-user rate-splitting multiple access (RSMA) architecture based on the transmissive RIS transmitter, where the channel state information (CSI) is only accquired partially. We investigate the weighted sum-rate maximization problem by jointly optimizing the power, RIS transmissive coefficients and common rate allocated to each user. Due to the coupling of optimization variables, the problem is nonconvex, and it is difficult to directly obtain the optimal solution. Hence, a block coordinate descent (BCD) algorithm based on sample average approximation (SAA) and weighted minimum mean square error (WMMSE) is proposed to tackle it. Numerical results illustrate that the transmissive RIS transmitter with ratesplitting architecture has advantages over conventional space division multiple access (SDMA) and non-orthgonal multiple access (NOMA). △ Less

Submitted 23 July, 2023; originally announced July 2023.

arXiv:2306.16206 [pdf, other]

Near-Field Beam Management for Extremely Large-Scale Array Communications

Authors: Changsheng You, Yunpu Zhang, Chenyu Wu, Yong Zeng, Beixiong Zheng, Li Chen, Linglong Dai, A. Lee Swindlehurst

Abstract: Extremely large-scale arrays (XL-arrays) have emerged as a promising technology to achieve super-high spectral efficiency and spatial resolution in future wireless systems. The large aperture of XL-arrays means that spherical rather than planar wavefronts must be considered, and a paradigm shift from far-field to near-field communications is necessary. Unlike existing works that have mainly consid… ▽ More Extremely large-scale arrays (XL-arrays) have emerged as a promising technology to achieve super-high spectral efficiency and spatial resolution in future wireless systems. The large aperture of XL-arrays means that spherical rather than planar wavefronts must be considered, and a paradigm shift from far-field to near-field communications is necessary. Unlike existing works that have mainly considered far-field beam management, we study the new near-field beam management for XL-arrays. We first provide an overview of near-field communications and introduce various applications of XL-arrays in both outdoor and indoor scenarios. Then, three typical near-field beam management methods for XL-arrays are discussed: near-field beam training, beam tracking, and beam scheduling. We point out their main design issues and propose promising solutions to address them. Moreover, other important directions in near-field communications are also highlighted to motivate future research. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: We studied the new near-field beam management for XL-arrays. This paper has been submitted to IEEE for possible publication

arXiv:2306.02759 [pdf, other]

On the Role of ViT and CNN in Semantic Communications: Analysis and Prototype Validation

Authors: Hanju Yoo, Linglong Dai, Songkuk Kim, Chan-Byoung Chae

Abstract: Semantic communications have shown promising advancements by optimizing source and channel coding jointly. However, the dynamics of these systems remain understudied, limiting research and performance gains. Inspired by the robustness of Vision Transformers (ViTs) in handling image nuisances, we propose a ViT-based model for semantic communications. Our approach achieves a peak signal-to-noise rat… ▽ More Semantic communications have shown promising advancements by optimizing source and channel coding jointly. However, the dynamics of these systems remain understudied, limiting research and performance gains. Inspired by the robustness of Vision Transformers (ViTs) in handling image nuisances, we propose a ViT-based model for semantic communications. Our approach achieves a peak signal-to-noise ratio (PSNR) gain of +0.5 dB over convolutional neural network variants. We introduce novel measures, average cosine similarity and Fourier analysis, to analyze the inner workings of semantic communications and optimize the system's performance. We also validate our approach through a real wireless channel prototype using software-defined radio (SDR). To the best of our knowledge, this is the first investigation of the fundamental workings of a semantic communications system, accompanied by the pioneering hardware implementation. To facilitate reproducibility and encourage further research, we provide open-source code, including neural network implementations and LabVIEW codes for SDR-based wireless transmission systems. △ Less

Submitted 5 June, 2023; originally announced June 2023.

arXiv:2305.12459 [pdf, other]

CASA-ASR: Context-Aware Speaker-Attributed ASR

Authors: Mohan Shi, Zhihao Du, Qian Chen, Fan Yu, Yangze Li, Shiliang Zhang, Jie Zhang, Li-Rong Dai

Abstract: Recently, speaker-attributed automatic speech recognition (SA-ASR) has attracted a wide attention, which aims at answering the question ``who spoke what''. Different from modular systems, end-to-end (E2E) SA-ASR minimizes the speaker-dependent recognition errors directly and shows a promising applicability. In this paper, we propose a context-aware SA-ASR (CASA-ASR) model by enhancing the contextu… ▽ More Recently, speaker-attributed automatic speech recognition (SA-ASR) has attracted a wide attention, which aims at answering the question ``who spoke what''. Different from modular systems, end-to-end (E2E) SA-ASR minimizes the speaker-dependent recognition errors directly and shows a promising applicability. In this paper, we propose a context-aware SA-ASR (CASA-ASR) model by enhancing the contextual modeling ability of E2E SA-ASR. Specifically, in CASA-ASR, a contextual text encoder is involved to aggregate the semantic information of the whole utterance, and a context-dependent scorer is employed to model the speaker discriminability by contrasting with speakers in the context. In addition, a two-pass decoding strategy is further proposed to fully leverage the contextual modeling ability resulting in a better recognition performance. Experimental results on AliMeeting corpus show that the proposed CASA-ASR model outperforms the original E2E SA-ASR system with a relative improvement of 11.76% in terms of speaker-dependent character error rate. △ Less

Submitted 21 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech2023

arXiv:2305.12450 [pdf, other]

Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

Authors: Mohan Shi, Yuchun Shu, Lingyun Zuo, Qian Chen, Shiliang Zhang, Jie Zhang, Li-Rong Dai

Abstract: For speech interaction, voice activity detection (VAD) is often used as a front-end. However, traditional VAD algorithms usually need to wait for a continuous tail silence to reach a preset maximum duration before segmentation, resulting in a large latency that affects user experience. In this paper, we propose a novel semantic VAD for low-latency segmentation. Different from existing methods, a f… ▽ More For speech interaction, voice activity detection (VAD) is often used as a front-end. However, traditional VAD algorithms usually need to wait for a continuous tail silence to reach a preset maximum duration before segmentation, resulting in a large latency that affects user experience. In this paper, we propose a novel semantic VAD for low-latency segmentation. Different from existing methods, a frame-level punctuation prediction task is added to the semantic VAD, and the artificial endpoint is included in the classification category in addition to the often-used speech presence and absence. To enhance the semantic information of the model, we also incorporate an automatic speech recognition (ASR) related semantic loss. Evaluations on an internal dataset show that the proposed method can reduce the average latency by 53.3% without significant deterioration of character error rate in the back-end ASR compared to the traditional VAD approach. △ Less

Submitted 21 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech2023

arXiv:2305.12111 [pdf, other]

Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection

Authors: Xiao-Min Zeng, Yan Song, Zhu Zhuo, Yu Zhou, Yu-Hong Li, Hui Xue, Li-Rong Dai, Ian McLoughlin

Abstract: In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD). GeCo exploits a Predictive AutoEncoder (PAE) equipped with self-attention as a generative model to perform frame-level prediction. The output of the PAE together with original normal samples, are used for supervised contrastive representative learning in a multi-t… ▽ More In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD). GeCo exploits a Predictive AutoEncoder (PAE) equipped with self-attention as a generative model to perform frame-level prediction. The output of the PAE together with original normal samples, are used for supervised contrastive representative learning in a multi-task framework. Besides cross-entropy loss between classes, contrastive loss is used to separate PAE output and original samples within each class. GeCo aims to better capture context information among frames, thanks to the self-attention mechanism for PAE model. Furthermore, GeCo combines generative and contrastive learning from which we aim to yield more effective and informative representations, compared to existing methods. Extensive experiments have been conducted on the DCASE2020 Task2 development dataset, showing that GeCo outperforms state-of-the-art generative and discriminative methods. △ Less

Submitted 20 May, 2023; originally announced May 2023.

Comments: Accepted by ICASSP2023

arXiv:2305.11819 [pdf, other]

doi 10.26599/TST.2023.9010001

Reconfigurable Intelligent Surfaces for 6G: Nine Fundamental Issues and One Critical Problem

Authors: Zijian Zhang, Linglong Dai

Abstract: Thanks to the recent advances in metamaterials, reconfigurable intelligent surface (RIS) has emerged as a promising technology for future 6G wireless communications. Benefiting from its high array gain, low cost, and low power consumption, RISs are expected to greatly enlarge signal coverage, improve system capacity, and increase energy efficiency. In this article, we systematically overview the e… ▽ More Thanks to the recent advances in metamaterials, reconfigurable intelligent surface (RIS) has emerged as a promising technology for future 6G wireless communications. Benefiting from its high array gain, low cost, and low power consumption, RISs are expected to greatly enlarge signal coverage, improve system capacity, and increase energy efficiency. In this article, we systematically overview the emerging RIS technology with the focus on its key basics, nine fundamental issues, and one critical problem. Specifically, we first explain the RIS basics, including its working principles, hardware structures, and potential benefits for communications. Based on these basics, nine fundamental issues of RISs, such as ``What's the differences between RISs and massive MIMO?'' and ``Is RIS really intelligent?'', are explicitly addressed to elaborate its technical features, distinguish it from existing technologies, and clarify some misunderstandings in the literature. Then, one critical problem of RISs is revealed that, due to the ``multiplicative fading'' effect, existing passive RISs can hardly achieve visible performance gains in many communication scenarios with strong direct links. To address this critical problem, a potential solution called active RISs is introduced, and its effectiveness is demonstrated by numerical simulations. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: To appear in TST as an invited paper. This paper discusses nine fundamental issues and one critical problem of RISs. Highly related works can be found at arxiv:2103.15154

arXiv:2305.02875 [pdf, other]

The Manifestation of Spatial Wideband Effect in Circular Array: From Beam Split to Beam Defocus

Authors: Zidong Wu, Linglong Dai

Abstract: Millimeter-wave (mmWave) and terahertz (THz) communications with hybrid precoding architectures have been regarded as energy-efficient solutions to fulfill the vision of high-speed transmissions for 6G communications. Benefiting from the advantages of providing a wide scan range and flat array gain, the uniform circular array (UCA) has attracted much attention. However, the growing bandwidth of mm… ▽ More Millimeter-wave (mmWave) and terahertz (THz) communications with hybrid precoding architectures have been regarded as energy-efficient solutions to fulfill the vision of high-speed transmissions for 6G communications. Benefiting from the advantages of providing a wide scan range and flat array gain, the uniform circular array (UCA) has attracted much attention. However, the growing bandwidth of mmWave and THz communications require frequency-independent phase shifts, which can not be perfectly realized through frequency-independent phase shifters (PSs) in classical hybrid precoding architectures. This mismatch causes the beam defocus effect in UCA wideband communications, where the high-gain beams could not form at non-central frequencies. In this paper, we first investigate the characteristics of the beam defocus effect distinguishing itself from the beam split effect in uniform linear array (ULA) systems. The beam pattern of UCA in both frequency domain and angular domain is analyzed, characterizing the beamforming loss caused by the beam defocus effect. Then, the delay-phase-precoding (DPP) architecture which leverages the true-time-delay (TTD) devices to generate frequency-dependent phase shifts is employed to mitigate the beam defocus effect. Finally, performance analysis and extensive simulation results are provided to evaluate the effectiveness of the DPP architecture in UCA systems. △ Less

Submitted 4 May, 2023; originally announced May 2023.

Comments: In this paper, the mechanism of the beam defocus effect in circular array systems is investigated for the first time. The delay-phase precoding architecture is employed to mitigate the beam defocus effect. Simulation codes will be provided to reproduce the results: http://oa.ee.tsinghua.edu.cn/dailinglong/publications/publications.html

arXiv:2305.01980 [pdf, other]

Diverse and Vivid Sound Generation from Text Descriptions

Authors: Guangwei Li, Xuenan Xu, Lingfeng Dai, Mengyue Wu, Kai Yu

Abstract: Previous audio generation mainly focuses on specified sound classes such as speech or music, whose form and content are greatly restricted. In this paper, we go beyond specific audio generation by using natural language description as a clue to generate broad sounds. Unlike visual information, a text description is concise by its nature but has rich hidden meanings beneath, which poses a higher po… ▽ More Previous audio generation mainly focuses on specified sound classes such as speech or music, whose form and content are greatly restricted. In this paper, we go beyond specific audio generation by using natural language description as a clue to generate broad sounds. Unlike visual information, a text description is concise by its nature but has rich hidden meanings beneath, which poses a higher possibility and complexity on the audio to be generated. A Variation-Quantized GAN is used to train a codebook learning discrete representations of spectrograms. For a given text description, its pre-trained embedding is fed to a Transformer to sample codebook indices to decode a spectrogram to be further transformed into waveform by a melgan vocoder. The generated waveform has high quality and fidelity while excellently corresponding to the given text. Experiments show that our proposed method is capable of generating natural, vivid audios, achieving superb quantitative and qualitative results. △ Less

Submitted 3 May, 2023; originally announced May 2023.

arXiv:2303.03689 [pdf, other]

AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

Authors: Kang Li, Yan Song, Li-Rong Dai, Ian McLoughlin, Xin Fang, Lin Liu

Abstract: In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have recently shown promise on DCASE2022 challenge task4 where they help mitigate a lack of sufficient real annotated data. However, mainly due to differences betwe… ▽ More In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have recently shown promise on DCASE2022 challenge task4 where they help mitigate a lack of sufficient real annotated data. However, mainly due to differences between the AT and SED tasks, it is suboptimal to directly utilize outputs from a pretrained AST model. Hence the proposed AST-SED adopts an encoder-decoder architecture to enable effective and efficient fine-tuning without needing to redesign or retrain the AST model. Specifically, the Frequency-wise Transformer Encoder (FTE) consists of transformers with self attention along the frequency axis to address multiple overlapped audio events issue in a single clip. The Local Gated Recurrent Units Decoder (LGD) consists of nearest-neighbor interpolation (NNI) and Bidirectional Gated Recurrent Units (Bi-GRU) to compensate for temporal resolution loss in the pretrained AST model output. Experimental results on DCASE2022 task4 development set have demonstrated the superiority of the proposed AST-SED with FTE-LGD architecture. Specifically, the Event-Based F1-score (EB-F1) of 59.60% and Polyphonic Sound detection Score scenario1 (PSDS1) score of 0.5140 significantly outperform CRNN and other pretrained AST-based systems. △ Less

Submitted 7 March, 2023; originally announced March 2023.

Comments: accepted to ICASSP 2023

arXiv:2301.09082 [pdf, other]

Location Division Multiple Access for Near-Field Communications

Authors: Zidong Wu, Linglong Dai

Abstract: Spatial division multiple access (SDMA) is essential to improve the spectrum efficiency for multi-user multiple-input multiple-output (MIMO) communications. The classical SDMA for massive MIMO with hybrid precoding heavily relies on the angular orthogonality in the far field to distinguish multiple users at different angles, which fails to fully exploit spatial resources in the distance domain. Wi… ▽ More Spatial division multiple access (SDMA) is essential to improve the spectrum efficiency for multi-user multiple-input multiple-output (MIMO) communications. The classical SDMA for massive MIMO with hybrid precoding heavily relies on the angular orthogonality in the far field to distinguish multiple users at different angles, which fails to fully exploit spatial resources in the distance domain. With dramatically increasing number of antennas, extremely large-scale antenna array (ELAA) introduces additional resolution in the distance domain in the near field. In this paper, we propose the concept of location division multiple access (LDMA) to provide a new possibility to enhance spectrum efficiency. The key idea is to exploit extra spatial resources in the distance domain to serve different users at different locations (determined by angles and distances) in the near field. Specifically, the asymptotic orthogonality of beam focusing vectors in the distance domain is proved, which reveals that near-field beam focusing is able to focus signals on specific locations to mitigate inter-user interferences. Simulation results verify the superiority of the proposed LDMA over classical SDMA in different scenarios. △ Less

Submitted 22 January, 2023; originally announced January 2023.

Comments: Accepted by IEEE ICC 2023. This paper investigates the concept of location division multiple access (LDMA) to exploit extra spatial resources in distance domain for multiple access, exploring a new possibility to enhance spectrum efficiency. The journal version is: arXiv:2208.06349. Simulation codes are provided at: http://oa.ee.tsinghua.edu.cn/dailinglong/publications/publications.html

arXiv:2301.03035 [pdf, ps, other]

Cross Far- and Near-field Wireless Communications in Terahertz Ultra-large Antenna Array Systems

Authors: Chong Han, Yuhang Chen, Longfei Yan, Zhi Chen, Linglong Dai

Abstract: Terahertz (THz) band owning the abundant multi-ten-GHz bandwidth is capable to support Terabit-per-second wireless communications, which is a pillar technology for 6G and beyond systems. With sub-millimeter-long antennas, ultra-massive (UM) MIMO and intelligent surface (IS) systems with thousands of array elements are exploited to effectively combat the distance limitation and blockage problems, w… ▽ More Terahertz (THz) band owning the abundant multi-ten-GHz bandwidth is capable to support Terabit-per-second wireless communications, which is a pillar technology for 6G and beyond systems. With sub-millimeter-long antennas, ultra-massive (UM) MIMO and intelligent surface (IS) systems with thousands of array elements are exploited to effectively combat the distance limitation and blockage problems, which compose a promising THz ultra-large antenna array (ULAA) system. As a combined effect of wavelength and array aperture, the resulting coverage of THz systems ranges from near-field to far-field, leading to a new paradigm of cross-field communications. Although channel models, communications theories, and networking strategies have been studied for far-field and near-field separately, the unified design of cross-field communications that achieve high spectral efficiency and low complexity is still missing. In this article, the challenges and features of THz ULAA cross-field communications are investigated. Furthermore, cross-field solutions in three perspectives are presented, including a hybrid spherical- and planar-wave channel model, cross-field channel estimation, and widely-spaced multi-subarray hybrid beamforming, where a subarray as a basic unit in THz ULAA systems is exploited. The approximation error of channel modeling accuracy, spectral efficiency, and estimation error of these designs are numerically evaluated. Finally, as a roadmap of THz ULAA cross-field communications, multiple open problems and potential research directions are elaborated. △ Less

Submitted 3 August, 2023; v1 submitted 8 January, 2023; originally announced January 2023.

arXiv:2301.00161 [pdf, other]

doi 10.1109/GLOBECOM48099.2022.10001687

Active RISs: Signal Modeling, Asymptotic Analysis, and Beamforming Design

Authors: Zijian Zhang, Linglong Dai, Xibi Chen, Changhao Liu, Fan Yang, Robert Schober, H. Vincent Poor

Abstract: Reconfigurable intelligent surfaces (RISs) have emerged as a candidate technology for future 6G networks. However, due to the "multiplicative fading" effect, the existing passive RISs only achieve a negligible capacity gain in environments with strong direct links. In this paper, the concept of active RISs is studied to overcome this fundamental limitation. Unlike the existing passive RISs that re… ▽ More Reconfigurable intelligent surfaces (RISs) have emerged as a candidate technology for future 6G networks. However, due to the "multiplicative fading" effect, the existing passive RISs only achieve a negligible capacity gain in environments with strong direct links. In this paper, the concept of active RISs is studied to overcome this fundamental limitation. Unlike the existing passive RISs that reflect signals without amplification, active RISs can amplify the reflected signals via amplifiers integrated into their elements. To characterize the signal amplification and incorporate the noise introduced by the active components, we verify the signal model of active RISs through the experimental measurements on a fabricated active RIS element. Based on the verified signal model, we formulate the sum-rate maximization problem for an active RIS aided multi-user multiple-input single-output (MU-MISO) system and a joint transmit precoding and reflect beamforming algorithm is proposed to solve this problem. Simulation results show that, in a typical wireless system, the existing passive RISs can realize only a negligible sum-rate gain of 3%, while the active RISs can achieve a significant sum-rate gain of 62%, thus overcoming the "multiplicative fading" effect. Finally, we develop a 64-element active RIS aided wireless communication prototype, and the significant gain of active RISs is validated by field test. △ Less

Submitted 31 December, 2022; originally announced January 2023.

Comments: Accepted by IEEE GLOBECOM 2022. This paper includes a 64-element active RIS aided wireless communication prototype and the field test results. The journal version is at: arXiv:2103.15154. Simulation codes are provided at: http://oa.ee.tsinghua.edu.cn/dailinglong/publications/publications.html

Journal ref: IEEE GLOBECOM 2022

arXiv:2212.14654 [pdf, other]

Enabling More Users to Benefit from Near-Field Communications: From Linear to Circular Array

Authors: Zidong Wu, Mingyao Cui, Linglong Dai

Abstract: Massive multiple-input multiple-output (MIMO) for 5G is evolving into the extremely large-scale antenna array (ELAA) to increase the spectrum efficiency by orders of magnitude for 6G communications. ELAA introduces spherical-wave-based near-field communications, where channel capacity can be significantly improved for single-user and multi-user scenarios. Unfortunately, the near-field region at la… ▽ More Massive multiple-input multiple-output (MIMO) for 5G is evolving into the extremely large-scale antenna array (ELAA) to increase the spectrum efficiency by orders of magnitude for 6G communications. ELAA introduces spherical-wave-based near-field communications, where channel capacity can be significantly improved for single-user and multi-user scenarios. Unfortunately, the near-field region at large incidence/emergence angles is greatly reduced with the widely studied uniform linear array (ULA). Thus, many randomly distributed users may fail to benefit from near-field communications. In this paper, we leverage the rotational symmetry of uniform circular array (UCA) to provide uniform and enlarged near-field regions at all angles, enabling more users to benefit from near-field communications. Specifically, by exploiting the geometrical relationship between UCA and users, the near-field beamforming technique for UCA is developed. Based on the analysis of near-field beamforming, we reveal that UCA is able to provide a larger near-field region than ULA in terms of the effective Rayleigh distance. Moreover, a concentric-ring codebook is designed to realize efficient codebook-based beamforming in the near-field region. In addition, we find out that UCA could generate orthogonal near-field beams along the same direction when the focal point of the near-field beam is exactly the zeros of other beams, which has the potential to further improve spectrum efficiency in multi-user communications compared with ULA. Simulation results are provided to verify the effectiveness of theoretical analysis and feasibility of UCA to enable more users to benefit from near-field communications by broadening the near-field region. △ Less

Submitted 30 October, 2023; v1 submitted 30 December, 2022; originally announced December 2022.

Comments: Accepted by IEEE TWC. In this paper, the rotational symmetry of UCA is leveraged to provide uniform and enlarged near-field regions, enabling more users to benefit from near-field communications. Simulation codes will be provided to reproduce the results in this paper: http://oa.ee.tsinghua.edu.cn/dailinglong/publications/publications.html

arXiv:2212.08401 [pdf, other]

Near-Field Wideband Channel Estimation for Extremely Large-Scale MIMO

Authors: Mingyao Cui, Linglong Dai

Abstract: Extremely large-scale multiple-input-multiple-output (XL-MIMO) at millimeter-wave (mmWave) and terahertz (THz) bands plays an important role in supporting extreme high beamforming gain as well as ultra-wideband spectrum resources. Unfortunately, accurate wideband XL-MIMO channel estimation suffers from the new challenge called as the near-field beam split effect. Prior works either neglect the acc… ▽ More Extremely large-scale multiple-input-multiple-output (XL-MIMO) at millimeter-wave (mmWave) and terahertz (THz) bands plays an important role in supporting extreme high beamforming gain as well as ultra-wideband spectrum resources. Unfortunately, accurate wideband XL-MIMO channel estimation suffers from the new challenge called as the near-field beam split effect. Prior works either neglect the accurate near-field channel model or fail to exploit the beam split effect, resulting in poor channel estimation accuracy for wideband XL-MIMO. To tackle this problem, this paper proposes a bilinear pattern detection (BPD) based approach to accurately recover the wideband XL-MIMO channel. Specifically, by analyzing the characteristics of near-field wideband channels, we first reveal the bilinear pattern of the near-field beam split effect, which implies that the sparse support set of near-field channels in both the angle and the distance domains can be regarded as a linear function against frequency. Then, inspired by the classical simultaneously orthogonal matching pursuit technique, we use the bilinear pattern to estimate the angle-of-arrival (AoA) and distance parameters of each near-field path component at all frequencies. In this way, the entire wideband XL-MIMO channel can be recovered by compressed sensing algorithms. Moreover, we provide the computational complexity of the proposed algorithm compared with existing algorithms. Finally, simulation results demonstrate that our scheme can achieve the accurate estimation of the near-field wideband XL-MIMO channel in the presence of near-field beam split effect. △ Less

Submitted 16 December, 2022; originally announced December 2022.

Comments: This paper has been accepted by Science China Information Sciences. Simulation codes will be provided to reproduce the results in this paper: http://oa.ee.tsinghua.edu.cn/dailinglong/publications/publications.html

arXiv:2211.11275 [pdf, other]

doi 10.1109/TMM.2023.3275873

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Authors: Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, Lirong Dai, Daxin Jiang, **yu Li, Furu Wei

Abstract: Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech rep… ▽ More Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm. △ Less

Submitted 19 May, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: 11 pages, Accepted by IEEE Transactions on Multimedia

arXiv:2211.00511 [pdf, other]

A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings

Authors: Mohan Shi, Jie Zhang, Zhihao Du, Fan Yu, Qian Chen, Shiliang Zhang, Li-Rong Dai

Abstract: Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training (SC-FD-SOT), single-channel word-level diarization with SOT (SC-WD-SOT) and joint training of single-channel target-speaker separation and ASR (SC-TS-ASR) can be explo… ▽ More Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training (SC-FD-SOT), single-channel word-level diarization with SOT (SC-WD-SOT) and joint training of single-channel target-speaker separation and ASR (SC-TS-ASR) can be exploited to partially solve this problem. In this paper, we propose three corresponding multichannel (MC) SA-ASR approaches, namely MC-FD-SOT, MC-WD-SOT and MC-TS-ASR. For different tasks/models, different multichannel data fusion strategies are considered, including channel-level cross-channel attention for MC-FD-SOT, frame-level cross-channel attention for MC-WD-SOT and neural beamforming for MC-TS-ASR. Results on the AliMeeting corpus reveal that our proposed models can consistently outperform the corresponding single-channel counterparts in terms of the speaker-dependent character error rate. △ Less

Submitted 1 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

arXiv:2210.15324 [pdf, other]

Robust Data2vec: Noise-robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning

Authors: Qiu-Shi Zhu, Long Zhou, Jie Zhang, Shu-Jie Liu, Yu-Chen Hu, Li-Rong Dai

Abstract: Self-supervised pre-training methods based on contrastive learning or regression tasks can utilize more unlabeled data to improve the performance of automatic speech recognition (ASR). However, the robustness impact of combining the two pre-training tasks and constructing different negative samples for contrastive learning still remains unclear. In this paper, we propose a noise-robust data2vec fo… ▽ More Self-supervised pre-training methods based on contrastive learning or regression tasks can utilize more unlabeled data to improve the performance of automatic speech recognition (ASR). However, the robustness impact of combining the two pre-training tasks and constructing different negative samples for contrastive learning still remains unclear. In this paper, we propose a noise-robust data2vec for self-supervised speech representation learning by jointly optimizing the contrastive learning and regression tasks in the pre-training stage. Furthermore, we present two improved methods to facilitate contrastive learning. More specifically, we first propose to construct patch-based non-semantic negative samples to boost the noise robustness of the pre-training model, which is achieved by dividing the features into patches at different sizes (i.e., so-called negative samples). Second, by analyzing the distribution of positive and negative samples, we propose to remove the easily distinguishable negative samples to improve the discriminative capacity for pre-training models. Experimental results on the CHiME-4 dataset show that our method is able to improve the performance of the pre-trained model in noisy scenarios. We find that joint training of the contrastive learning and regression tasks can avoid the model collapse to some extent compared to only training the regression task. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: Submitted to ICASSP 2023

arXiv:2210.03730 [pdf, other]

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

Authors: Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu, Lirong Dai, **yu Li, Furu Wei

Abstract: The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods. In this paper, we propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Leveraging hidden-unit as an interface to align speech and text, we can decomp… ▽ More The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods. In this paper, we propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Leveraging hidden-unit as an interface to align speech and text, we can decompose the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be jointly pre-trained with unpaired speech and text data respectively. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks. Experimental results show that SpeechUT gets substantial improvements over strong baselines, and achieves state-of-the-art performance on both the LibriSpeech ASR and MuST-C ST tasks. To better understand the proposed SpeechUT, detailed analyses are conducted. The code and pre-trained models are available at https://aka.ms/SpeechUT. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: 14 pages, accepted by EMNLP 2022

arXiv:2209.15329 [pdf, other]

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Authors: Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, Lirong Dai, **yu Li, Furu Wei

Abstract: How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discret… ▽ More How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM. △ Less

Submitted 15 June, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

Comments: We have corrected the errors in the pre-training data for SpeechLM-P Base models, new results are updated

arXiv:2209.07884 [pdf, ps, other]

Workflow-based Fast Data-driven Predictive Control with Disturbance Observer in Cloud-edge Collaborative Architecture

Authors: Runze Gao, Qiwen Li, Li Dai, Yufeng Zhan, Yuanqing Xia

Abstract: Data-driven predictive control (DPC) has been studied and used in various scenarios, since it could generate the predicted control sequence only relying on the historical input and output data. Recently, based on cloud computing, data-driven predictive cloud control system (DPCCS) has been proposed with the advantage of sufficient computational resources. However, the existing computation mode of… ▽ More Data-driven predictive control (DPC) has been studied and used in various scenarios, since it could generate the predicted control sequence only relying on the historical input and output data. Recently, based on cloud computing, data-driven predictive cloud control system (DPCCS) has been proposed with the advantage of sufficient computational resources. However, the existing computation mode of DPCCS is centralized. This computation mode could not utilize fully the computing power of cloud computing, of which the structure is distributed. Thus, the computation delay could not been reduced and still affects the control quality. In this paper, a novel cloud-edge collaborative containerised workflow-based DPC system with disturbance observer (DOB) is proposed, to improve the computation efficiency and guarantee the control accuracy. First, a construction method for the DPC workflow is designed, to match the distributed processing environment of cloud computing. But the non-computation overheads of the workflow tasks are relatively high. Therefore, a cloud-edge collaborative control scheme with DOB is designed. The low-weight data could be truncated to reduce the non-computation overheads. Meanwhile, we design an edge DOB to estimate and compensate the uncertainty in cloud workflow processing, and obtain the composite control variable. The UUB stability of the DOB is also proved. Third, to execute the workflow-based DPC controller and evaluate the proposed cloud-edge collaborative control scheme with DOB in the real cloud environment, we design and implement a practical workflow-based cloud control experimental system based on container technology. Finally, a series of evaluations show that, the computation times are decreased by 45.19% and 74.35% for two real-time control examples, respectively, and by at most 85.10% for a high-dimension control example. △ Less

Submitted 16 September, 2022; originally announced September 2022.

Comments: 58 pages and 23 figures

arXiv:2209.01424 [pdf, ps, other]

Dynamic Write-Voltage Design and Read-Voltage Optimization for MLC NAND Flash Memory

Authors: Runbin Cai, Yi Fang, Zhifang Shi, Lin Dai, Guojun Han

Abstract: To mitigate the impact of noise and interference on multi-level-cell (MLC) flash memory with the use of low-density parity-check (LDPC) codes, we propose a dynamic write-voltage design scheme considering the asymmetric property of raw bit error rate (RBER), which can obtain the optimal write voltage by minimizing a cost function. In order to further improve the decoding performance of flash memory… ▽ More To mitigate the impact of noise and interference on multi-level-cell (MLC) flash memory with the use of low-density parity-check (LDPC) codes, we propose a dynamic write-voltage design scheme considering the asymmetric property of raw bit error rate (RBER), which can obtain the optimal write voltage by minimizing a cost function. In order to further improve the decoding performance of flash memory, we put forward a low-complexity entropy-based read-voltage optimization scheme, which derives the read voltages by searching for the optimal entropy value via a log-likelihood ratio (LLR)-aware cost function. Simulation results demonstrate the superiority of our proposed dynamic write-voltage design scheme and read-voltage optimization scheme with respect to the existing counterparts. △ Less

Submitted 3 September, 2022; originally announced September 2022.

Comments: 12 pages, 6 figures, submitted to China Communication

arXiv:2208.06349 [pdf, other]

Multiple access for near-field communications: SDMA or LDMA?

Authors: Zidong Wu, Linglong Dai

Abstract: Spatial division multiple access (SDMA) is essential to improve the spectrum efficiency for multi-user multiple-input multiple-output (MIMO) communications. The classical SDMA for massive MIMO with hybrid precoding heavily relies on the angular orthogonality in the far field to distinguish multiple users at different angles, which fails to fully exploit spatial resources in the distance domain. Wi… ▽ More Spatial division multiple access (SDMA) is essential to improve the spectrum efficiency for multi-user multiple-input multiple-output (MIMO) communications. The classical SDMA for massive MIMO with hybrid precoding heavily relies on the angular orthogonality in the far field to distinguish multiple users at different angles, which fails to fully exploit spatial resources in the distance domain. With the dramatically increasing number of antennas, the extremely large-scale antenna array (ELAA) introduces additional resolution in the distance domain in the near field. In this paper, we propose the concept of location division multiple access (LDMA) to provide a new possibility to enhance spectrum efficiency compared with classical SDMA. The key idea is to exploit extra spatial resources in the distance domain to serve different users at different locations (determined by angles and distances) in the near field. Specifically, the asymptotic orthogonality of near-field beam focusing vectors in the distance domain is proved, which reveals that near-field beam focusing is able to focus signals on specific locations with limited leakage energy at other locations. This special property could be leveraged in hybrid precoding to mitigate inter-user interferences for spectrum efficiency enhancement. Moreover, we provide the spherical-domain codebook design method for LDMA communications with the uniform planar array, which provides the sampling method in the distance domain. Additionally, performance analysis of LDMA is provided to reveal that the asymptotic optimal spectrum efficiency could be achieved with the increasing number of antennas. Finally, simulation results verify the superiority of the proposed LDMA over SDMA in different scenarios. △ Less

Submitted 26 June, 2023; v1 submitted 12 August, 2022; originally announced August 2022.

Comments: Accepted by IEEE JSAC. This paper investigates the concept of location division multiple access (LDMA) to exploit extra spatial resources in distance domain for multiple access, exploring a new possibility to enhance spectrum efficiency. Simulation codes will be provided at: http://oa.ee.tsinghua.edu.cn/dailinglong/publications/publications.html

arXiv:2208.04509 [pdf, other]

Reconfigurable Intelligent Computational Surfaces: When Wave Propagation Control Meets Computing

Authors: Bo Yang, Xuelin Cao, **dan Xu, Chongwen Huang, George C. Alexandropoulos, Linglong Dai, M'erouane Debbah, H. Vincent Poor, Chau Yuen

Abstract: The envisioned sixth-generation (6G) of wireless networks will involve an intelligent integration of communications and computing, thereby meeting the urgent demands of diverse applications. To realize the concept of the smart radio environment, reconfigurable intelligent surfaces (RISs) are a promising technology for offering programmable propagation of im**ing electromagnetic signals via exter… ▽ More The envisioned sixth-generation (6G) of wireless networks will involve an intelligent integration of communications and computing, thereby meeting the urgent demands of diverse applications. To realize the concept of the smart radio environment, reconfigurable intelligent surfaces (RISs) are a promising technology for offering programmable propagation of im**ing electromagnetic signals via external control. However, the purely reflective nature of conventional RISs induces significant challenges in supporting computation-based applications, e.g., wave-based calculation and signal processing. To fulfil future communication and computing requirements, new materials are needed to complement the existing technologies of metasurfaces, enabling further diversification of electronics and their applications. In this event, we introduce the concept of reconfigurable intelligent computational surface (RICS), which is composed of two reconfigurable multifunctional layers: the `reconfigurable beamforming layer' which is responsible for tunable signal reflection, absorption, and refraction, and the `intelligence computation layer' that concentrates on metamaterials-based computing. By exploring the recent trends on computational metamaterials, RICSs have the potential to make joint communication and computation a reality. We further demonstrate two typical applications of RICSs for performing wireless spectrum sensing and secrecy signal processing. Future research challenges arising from the design and operation of RICSs are finally highlighted. △ Less

Submitted 3 October, 2022; v1 submitted 8 August, 2022; originally announced August 2022.

Showing 1–50 of 127 results for author: Dai, L