-
Data on the Move: Traffic-Oriented Data Trading Platform Powered by AI Agent with Common Sense
Authors:
Yi Yu,
Shengyue Yao,
Tianchen Zhou,
Yexuan Fu,
**gru Yu,
Ding Wang,
Xuhong Wang,
Cen Chen,
Yilun Lin
Abstract:
In the digital era, data has become a pivotal asset, advancing technologies such as autonomous driving. Despite this, data trading faces challenges like the absence of robust pricing methods and the lack of trustworthy trading mechanisms. To address these challenges, we introduce a traffic-oriented data trading platform named Data on The Move (DTM), integrating traffic simulation, data trading, an…
▽ More
In the digital era, data has become a pivotal asset, advancing technologies such as autonomous driving. Despite this, data trading faces challenges like the absence of robust pricing methods and the lack of trustworthy trading mechanisms. To address these challenges, we introduce a traffic-oriented data trading platform named Data on The Move (DTM), integrating traffic simulation, data trading, and Artificial Intelligent (AI) agents. The DTM platform supports evident-based data value evaluation and AI-based trading mechanisms. Leveraging the common sense capabilities of Large Language Models (LLMs) to assess traffic state and data value, DTM can determine reasonable traffic data pricing through multi-round interaction and simulations. Moreover, DTM provides a pricing method validation by simulating traffic systems, multi-agent interactions, and the heterogeneity and irrational behaviors of individuals in the trading market. Within the DTM platform, entities such as connected vehicles and traffic light controllers could engage in information collecting, data pricing, trading, and decision-making. Simulation results demonstrate that our proposed AI agent-based pricing approach enhances data trading by offering rational prices, as evidenced by the observed improvement in traffic efficiency. This underscores the effectiveness and practical value of DTM, offering new perspectives for the evolution of data markets and smart cities. To the best of our knowledge, this is the first study employing LLMs in data pricing and a pioneering data trading practice in the field of intelligent vehicles and smart cities.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Joint Beamforming and Antenna Position Optimization for Movable Antenna-Assisted Spectrum Sharing
Authors:
Xin Wei,
Weidong Mei,
Dong Wang,
Boyu Ning,
Zhi Chen
Abstract:
Fluid antennas (FAs) and movable antennas (MAs) have drawn increasing attention in wireless communications recently due to their ability to create favorable channel conditions via local antenna movement within a confined region. In this letter, we advance their application for cognitive radio to facilitate efficient spectrum sharing between primary and secondary communication systems. In particula…
▽ More
Fluid antennas (FAs) and movable antennas (MAs) have drawn increasing attention in wireless communications recently due to their ability to create favorable channel conditions via local antenna movement within a confined region. In this letter, we advance their application for cognitive radio to facilitate efficient spectrum sharing between primary and secondary communication systems. In particular, we aim to jointly optimize the transmit beamforming and MA positions at a secondary transmitter (ST) to maximize the received signal power at a secondary receiver (SR) subject to the constraints on its imposed co-channel interference power with multiple primary receivers (PRs). However, such an optimization problem is difficult to be optimally solved due to the highly nonlinear functions of the received signal/interference power at the SR/all PRs in terms of the MA positions. To drive useful insights, we first perform theoretical analyses to unveil MAs' capability to achieve maximum-ratio transmission with the SR and effective interference mitigation for all PRs at the same time. To solve the MA position optimization problem, we propose an alternating optimization (AO) algorithm to obtain a high-quality suboptimal solution. Numerical results demonstrate that our proposed algorithms can significantly outperform the conventional fixed-position antennas (FPAs) and other baseline schemes.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
AV-CrossNet: an Audiovisual Complex Spectral Map** Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling
Authors:
Vahid Ahmadi Kalkhorani,
Cheng Yu,
Anurag Kumar,
Ke Tan,
Buye Xu,
DeLiang Wang
Abstract:
Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral map** for speech separation by lever…
▽ More
Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral map** for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, and COG-MHEAR challenge. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model
Authors:
Di Wang,
Meiqi Hu,
Yao **,
Yuchun Miao,
Jiaqi Yang,
Yichu Xu,
Xiaolei Qin,
Jiaqi Ma,
Lingyu Sun,
Chenxing Li,
Chuan Fu,
Hongruixuan Chen,
Chengxi Han,
Naoto Yokoya,
**g Zhang,
Minqiang Xu,
Lin Liu,
Lefei Zhang,
Chen Wu,
Bo Du,
Dacheng Tao,
Liangpei Zhang
Abstract:
Foundation models (FMs) are revolutionizing the analysis and understanding of remote sensing (RS) scenes, including aerial RGB, multispectral, and SAR images. However, hyperspectral images (HSIs), which are rich in spectral information, have not seen much application of FMs, with existing methods often restricted to specific tasks and lacking generality. To fill this gap, we introduce HyperSIGMA,…
▽ More
Foundation models (FMs) are revolutionizing the analysis and understanding of remote sensing (RS) scenes, including aerial RGB, multispectral, and SAR images. However, hyperspectral images (HSIs), which are rich in spectral information, have not seen much application of FMs, with existing methods often restricted to specific tasks and lacking generality. To fill this gap, we introduce HyperSIGMA, a vision transformer-based foundation model for HSI interpretation, scalable to over a billion parameters. To tackle the spectral and spatial redundancy challenges in HSIs, we introduce a novel sparse sampling attention (SSA) mechanism, which effectively promotes the learning of diverse contextual features and serves as the basic block of HyperSIGMA. HyperSIGMA integrates spatial and spectral features using a specially designed spectral enhancement module. In addition, we construct a large-scale hyperspectral dataset, HyperGlobal-450K, for pre-training, which contains about 450K hyperspectral images, significantly surpassing existing datasets in scale. Extensive experiments on various high-level and low-level HSI tasks demonstrate HyperSIGMA's versatility and superior representational capability compared to current state-of-the-art methods. Moreover, HyperSIGMA shows significant advantages in scalability, robustness, cross-modal transferring capability, and real-world applicability.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
Authors:
Xueyuan Chen,
Dongchao Yang,
Dingdong Wang,
Xixin Wu,
Zhiyong Wu,
Helen Meng
Abstract:
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. In this paper, we propose a multi-modal DSR model by leveraging neural codec language modeling to improve the reconstruction results, especially for the speaker similarity and prosody naturalness. Our proposed model consists of: (…
▽ More
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. In this paper, we propose a multi-modal DSR model by leveraging neural codec language modeling to improve the reconstruction results, especially for the speaker similarity and prosody naturalness. Our proposed model consists of: (i) a multi-modal content encoder to extract robust phoneme embeddings from dysarthric speech with auxiliary visual inputs; (ii) a speaker codec encoder to extract and normalize the speaker-aware codecs from the dysarthric speech, in order to provide original timbre and normal prosody; (iii) a codec language model based speech decoder to reconstruct the speech based on the extracted phoneme embeddings and normalized codecs. Evaluations on the commonly used UASpeech corpus show that our proposed model can achieve significant improvements in terms of speaker similarity and prosody naturalness.
△ Less
Submitted 24 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Multi-Static ISAC based on Network-Assisted Full-Duplex Cell-Free Networks: Performance Analysis and Duplex Mode Optimization
Authors:
Fan Zeng,
Ruoyun Liu,
Xiaoyu Sun,
**gxuan Yu,
Jiamin Li,
Pengchen Zhu,
Dongming Wang,
Xiaohu You
Abstract:
Multi-static integrated sensing and communication (ISAC) technology, which can achieve a wider coverage range and avoid self-interference, is an important trend for the future development of ISAC. Existing multi-static ISAC designs are unable to support the asymmetric uplink (UL)/downlink (DL) communication requirements in the scenario while simultaneously achieving optimal sensing performance. Th…
▽ More
Multi-static integrated sensing and communication (ISAC) technology, which can achieve a wider coverage range and avoid self-interference, is an important trend for the future development of ISAC. Existing multi-static ISAC designs are unable to support the asymmetric uplink (UL)/downlink (DL) communication requirements in the scenario while simultaneously achieving optimal sensing performance. This paper proposes a design for multi-static ISAC based on network-assisted full-duplex (NAFD) cell-free networks can well solve the above problems. Under this design, closed-form expressions for the individual comunication rate and localization error rate are derived under imperfect channel state information, which are respectively utilized to assess the communication and sensing performances. Then, we propose a deep Q-network-based accesss point (AP) duplex mode optimization algorithm to obtain the trade-off between communication and sensing from the UL and DL perspectives of the APs. Simulation results demonstrate that the NAFD-based ISAC system proposed in this paper can achieve significantly better communication performance than other ISAC systems while ensuring minimal impact on sensing performance. Then, we validate the accuracy of the derived closed-form expressions. Furthermore, the proposed optimization algorithm achieves performance comparable to that of the exhaustion method with low complexity.
△ Less
Submitted 12 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Zero-Shot Fake Video Detection by Audio-Visual Consistency
Authors:
Xiaolou Li,
Zehua Liu,
Chen Chen,
Lantian Li,
Li Guo,
Dong Wang
Abstract:
Recent studies have advocated the detection of fake videos as a one-class detection task, predicated on the hypothesis that the consistency between audio and visual modalities of genuine data is more significant than that of fake data. This methodology, which solely relies on genuine audio-visual data while negating the need for forged counterparts, is thus delineated as a `zero-shot' detection pa…
▽ More
Recent studies have advocated the detection of fake videos as a one-class detection task, predicated on the hypothesis that the consistency between audio and visual modalities of genuine data is more significant than that of fake data. This methodology, which solely relies on genuine audio-visual data while negating the need for forged counterparts, is thus delineated as a `zero-shot' detection paradigm. This paper introduces a novel zero-shot detection approach anchored in content consistency across audio and video. By employing pre-trained ASR and VSR models, we recognize the audio and video content sequences, respectively. Then, the edit distance between the two sequences is computed to assess whether the claimed video is genuine. Experimental results indicate that, compared to two mainstream approaches based on semantic consistency and temporal consistency, our approach achieves superior generalizability across various deepfake techniques and demonstrates strong robustness against audio-visual perturbations. Finally, state-of-the-art performance gains can be achieved by simply integrating the decision scores of these three systems.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition
Authors:
Tianhao Wang,
Lantian Li,
Dong Wang
Abstract:
Deploying a well-optimized pre-trained speaker recognition model in a new domain often leads to a significant decline in performance. While fine-tuning is a commonly employed solution, it demands ample adaptation data and suffers from parameter inefficiency, rendering it impractical for real-world applications with limited data available for model adaptation. Drawing inspiration from the success o…
▽ More
Deploying a well-optimized pre-trained speaker recognition model in a new domain often leads to a significant decline in performance. While fine-tuning is a commonly employed solution, it demands ample adaptation data and suffers from parameter inefficiency, rendering it impractical for real-world applications with limited data available for model adaptation. Drawing inspiration from the success of adapters in self-supervised pre-trained models, this paper introduces a SE/BN adapter to address this challenge. By freezing the core speaker encoder and adjusting the feature maps' weights and activation distributions, we introduce a novel adapter utilizing trainable squeeze-and-excitation (SE) blocks and batch normalization (BN) layers, termed SE/BN adapter. Our experiments, conducted using VoxCeleb for pre-training and 4 genres from CN-Celeb for adaptation, demonstrate that the SE/BN adapter offers significant performance improvement over the baseline and competes with the vanilla fine-tuning approach by tuning just 1% of the parameters.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition
Authors:
Zhenyu Zhou,
Shibiao Xu,
Shi Yin,
Lantian Li,
Dong Wang
Abstract:
Data augmentation (DA) has played a pivotal role in the success of deep speaker recognition. Current DA techniques primarily focus on speaker-preserving augmentation, which does not change the speaker trait of the speech and does not create new speakers. Recent research has shed light on the potential of speaker augmentation, which generates new speakers to enrich the training dataset. In this stu…
▽ More
Data augmentation (DA) has played a pivotal role in the success of deep speaker recognition. Current DA techniques primarily focus on speaker-preserving augmentation, which does not change the speaker trait of the speech and does not create new speakers. Recent research has shed light on the potential of speaker augmentation, which generates new speakers to enrich the training dataset. In this study, we delve into two speaker augmentation approaches: speed perturbation (SP) and vocal tract length perturbation (VTLP). Despite the empirical utilization of both methods, a comprehensive investigation into their efficacy is lacking. Our study, conducted using two public datasets, VoxCeleb and CN-Celeb, revealed that both SP and VTLP are proficient at generating new speakers, leading to significant performance improvements in speaker recognition. Furthermore, they exhibit distinct properties in sensitivity to perturbation factors and data complexity, hinting at the potential benefits of their fusion. Our research underscores the substantial potential of speaker augmentation, highlighting the importance of in-depth exploration and analysis.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models
Authors:
Dongchao Yang,
Dingdong Wang,
Haohan Guo,
Xueyuan Chen,
Xixin Wu,
Helen Meng
Abstract:
In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compac…
▽ More
In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specifically, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, it shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models, it presents significant speech quality and generation speed improvement. Demos are released.
△ Less
Submitted 14 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
Performance Trade-off of Integrated Sensing and Communications for Multi-User Backscatter Systems
Authors:
Yuanming Tian,
Dan Wang,
Chuan Huang,
Wei Zhang
Abstract:
This paper studies the performance trade-off in a multi-user backscatter communication (BackCom) system for integrated sensing and communications (ISAC), where the multi-antenna ISAC transmitter sends excitation signals to power multiple single-antenna passive backscatter devices (BD), and the multi-antenna ISAC receiver performs joint sensing (localization) and communication tasks based on the ba…
▽ More
This paper studies the performance trade-off in a multi-user backscatter communication (BackCom) system for integrated sensing and communications (ISAC), where the multi-antenna ISAC transmitter sends excitation signals to power multiple single-antenna passive backscatter devices (BD), and the multi-antenna ISAC receiver performs joint sensing (localization) and communication tasks based on the backscattered signals from all BDs. Specifically, the localization performance is measured by the Cramér-Rao bound (CRB) on the transmission delay and direction of arrival (DoA) of the backscattered signals, whose closed-form expression is obtained by deriving the corresponding Fisher information matrix (FIM), and the communication performance is characterized by the sum transmission rate of all BDs. Then, to characterize the trade-off between the localization and communication performances, the CRB minimization problem with the communication rate constraint is formulated, and is shown to be non-convex in general. By exploiting the hidden convexity, we propose an approach that combines fractional programming (FP) and Schur complement techniques to transform the original problem into an equivalent convex form. Finally, numerical results reveal the trade-off between the CRB and sum transmission rate achieved by our proposed method.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Exploring Channel Estimation and Signal Detection for ODDM-based ISAC Systems
Authors:
Dezhi Wang,
Chongwen Huang,
Lei Liu,
Xiaoming Chen,
Wei Wang,
Zhaoyang Zhang,
Chau Yuen,
Mérouane Debbah
Abstract:
Inspired by providing reliable communications for high-mobility scenarios, in this letter, we investigate the channel estimation and signal detection in integrated sensing and communication~(ISAC) systems based on the orthogonal delay-Doppler multiplexing~(ODDM) modulation, which consists of a pulse-train that can achieve the orthogonality with respect to the resolution of the delay-Doppler~(DD) p…
▽ More
Inspired by providing reliable communications for high-mobility scenarios, in this letter, we investigate the channel estimation and signal detection in integrated sensing and communication~(ISAC) systems based on the orthogonal delay-Doppler multiplexing~(ODDM) modulation, which consists of a pulse-train that can achieve the orthogonality with respect to the resolution of the delay-Doppler~(DD) plane. To enhance the communication performance in the ODDM-based ISAC systems, we first propose a low-complexity approximation algorithm for channel estimation, which addresses the challenge of the high complexity from high resolution in the ODDM modulation, and achieves performance close to that of the maximum likelihood estimator scheme. Then, we employ the orthogonal approximate message-passing scheme to detect the symbols in the communication process based on the estimated channel information. Finally, simulation results show that the detection performance of ODDM is better than other multi-carrier modulation schemes. Specifically, the ODDM outperforms the orthogonal time frequency space scheme by 2.3 dB when the bit error ratio is $10^{-6}$.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Synchronization Scheme based on Pilot Sharing in Cell-Free Massive MIMO Systems
Authors:
Qihao Peng,
Hong Ren,
Zhendong Peng,
Cunhua Pan,
Maged Elkashlan,
Dongming Wang,
Jiangzhou Wang,
Xiaohu You
Abstract:
This paper analyzes the impact of pilot-sharing scheme on synchronization performance in a scenario where several slave access points (APs) with uncertain carrier frequency offsets (CFOs) and timing offsets (TOs) share a common pilot sequence. First, the Cramer-Rao bound (CRB) with pilot contamination is derived for pilot-pairing estimation. Furthermore, a maximum likelihood algorithm is presented…
▽ More
This paper analyzes the impact of pilot-sharing scheme on synchronization performance in a scenario where several slave access points (APs) with uncertain carrier frequency offsets (CFOs) and timing offsets (TOs) share a common pilot sequence. First, the Cramer-Rao bound (CRB) with pilot contamination is derived for pilot-pairing estimation. Furthermore, a maximum likelihood algorithm is presented to estimate the CFO and TO among the pairing APs. Then, to minimize the sum of CRBs, we devise a synchronization strategy based on a pilot-sharing scheme by jointly optimizing the cluster classification, synchronization overhead, and pilot-sharing scheme, while simultaneously considering the overhead and each AP's synchronization requirements. To solve this NP-hard problem, we simplify it into two sub-problems, namely cluster classification problem and the pilot sharing problem. To strike a balance between synchronization performance and overhead, we first classify the clusters by using the K-means algorithm, and propose a criteria to find a good set of master APs. Then, the pilot-sharing scheme is obtained by using the swap-matching operations. Simulation results validate the accuracy of our derivations and demonstrate the effectiveness of the proposed scheme over the benchmark schemes.
△ Less
Submitted 30 May, 2024; v1 submitted 29 May, 2024;
originally announced May 2024.
-
QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge
Authors:
Hongwei Bran Li,
Fernando Navarro,
Ivan Ezhov,
Amirhossein Bayat,
Dhritiman Das,
Florian Kofler,
Suprosanna Shit,
Diana Waldmannstetter,
Johannes C. Paetzold,
Xiaobin Hu,
Benedikt Wiestler,
Lucas Zimmer,
Tamaz Amiranashvili,
Chinmay Prabhakar,
Christoph Berger,
Jonas Weidner,
Michelle Alonso-Basant,
Arif Rashid,
Ujjwal Baid,
Wesam Adel,
Deniz Ali,
Bhakti Baheti,
Yingbin Bai,
Ishaan Bhatt,
Sabri Can Cetindag
, et al. (55 additional authors not shown)
Abstract:
Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the de…
▽ More
Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks.
△ Less
Submitted 24 June, 2024; v1 submitted 19 March, 2024;
originally announced May 2024.
-
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
Authors:
Chenyang Le,
Yao Qian,
Dongmei Wang,
Long Zhou,
Shujie Liu,
Xiaofei Wang,
Midia Yousefi,
Yanmin Qian,
**yu Li,
Sheng Zhao,
Michael Zeng
Abstract:
There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complex…
▽ More
There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
When Large Language Models Meet Optical Networks: Paving the Way for Automation
Authors:
Danshi Wang,
Yidi Wang,
Xiaotian Jiang,
Yao Zhang,
Yue Pang,
Min Zhang
Abstract:
Since the advent of GPT, large language models (LLMs) have brought about revolutionary advancements in all walks of life. As a superior natural language processing (NLP) technology, LLMs have consistently achieved state-of-the-art performance on numerous areas. However, LLMs are considered to be general-purpose models for NLP tasks, which may encounter challenges when applied to complex tasks in s…
▽ More
Since the advent of GPT, large language models (LLMs) have brought about revolutionary advancements in all walks of life. As a superior natural language processing (NLP) technology, LLMs have consistently achieved state-of-the-art performance on numerous areas. However, LLMs are considered to be general-purpose models for NLP tasks, which may encounter challenges when applied to complex tasks in specialized fields such as optical networks. In this study, we propose a framework of LLM-empowered optical networks, facilitating intelligent control of the physical layer and efficient interaction with the application layer through an LLM-driven agent (AI-Agent) deployed in the control layer. The AI-Agent can leverage external tools and extract domain knowledge from a comprehensive resource library specifically established for optical networks. This is achieved through user input and well-crafted prompts, enabling the generation of control instructions and result representations for autonomous operation and maintenance in optical networks. To improve LLM's capability in professional fields and stimulate its potential on complex tasks, the details of performing prompt engineering, establishing domain knowledge library, and implementing complex tasks are illustrated in this study. Moreover, the proposed framework is verified on two typical tasks: network alarm analysis and network performance optimization. The good response accuracies and sematic similarities of 2,400 test situations exhibit the great potential of LLM in optical networks.
△ Less
Submitted 24 June, 2024; v1 submitted 14 May, 2024;
originally announced May 2024.
-
A Real-Time Voice Activity Detection Based On Lightweight Neural
Authors:
Jidong Jia,
Pei Zhao,
Di Wang
Abstract:
Voice activity detection (VAD) is the task of detecting speech in an audio stream, which is challenging due to numerous unseen noises and low signal-to-noise ratios in real environments. Recently, neural network-based VADs have alleviated the degradation of performance to some extent. However, the majority of existing studies have employed excessively large models and incorporated future context,…
▽ More
Voice activity detection (VAD) is the task of detecting speech in an audio stream, which is challenging due to numerous unseen noises and low signal-to-noise ratios in real environments. Recently, neural network-based VADs have alleviated the degradation of performance to some extent. However, the majority of existing studies have employed excessively large models and incorporated future context, while neglecting to evaluate the operational efficiency and latency of the models. In this paper, we propose a lightweight and real-time neural network called MagicNet, which utilizes casual and depth separable 1-D convolutions and GRU. Without relying on future features as input, our proposed model is compared with two state-of-the-art algorithms on synthesized in-domain and out-domain test datasets. The evaluation results demonstrate that MagicNet can achieve improved performance and robustness with fewer parameter costs.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Physics-informed Score-based Diffusion Model for Limited-angle Reconstruction of Cardiac Computed Tomography
Authors:
Shuo Han,
Yongshun Xu,
Dayang Wang,
Bahareh Morovati,
Li Zhou,
Jonathan S. Maltz,
Ge Wang,
Hengyong Yu
Abstract:
Cardiac computed tomography (CT) has emerged as a major imaging modality for the diagnosis and monitoring of cardiovascular diseases. High temporal resolution is essential to ensure diagnostic accuracy. Limited-angle data acquisition can reduce scan time and improve temporal resolution, but typically leads to severe image degradation and motivates for improved reconstruction techniques. In this pa…
▽ More
Cardiac computed tomography (CT) has emerged as a major imaging modality for the diagnosis and monitoring of cardiovascular diseases. High temporal resolution is essential to ensure diagnostic accuracy. Limited-angle data acquisition can reduce scan time and improve temporal resolution, but typically leads to severe image degradation and motivates for improved reconstruction techniques. In this paper, we propose a novel physics-informed score-based diffusion model (PSDM) for limited-angle reconstruction of cardiac CT. At the sampling time, we combine a data prior from a diffusion model and a model prior obtained via an iterative algorithm and Fourier fusion to further enhance the image quality. Specifically, our approach integrates the primal-dual hybrid gradient (PDHG) algorithm with score-based diffusion models, thereby enabling us to reconstruct high-quality cardiac CT images from limited-angle data. The numerical simulations and real data experiments confirm the effectiveness of our proposed approach.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer
Authors:
Weifei **,
Yuxin Cao,
Junjie Su,
Qi Shen,
Kai Ye,
Derui Wang,
Jie Hao,
Ziyao Liu
Abstract:
In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of…
▽ More
In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while kee** sound naturalness due to our user study.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
AFDM Channel Estimation in Multi-Scale Multi-Lag Channels
Authors:
Rongyou Cao,
Yuheng Zhong,
Jiangbin Lyu,
Deqing Wang,
Liqun Fu
Abstract:
Affine Frequency Division Multiplexing (AFDM) is a brand new chirp-based multi-carrier (MC) waveform for high mobility communications, with promising advantages over Orthogonal Frequency Division Multiplexing (OFDM) and other MC waveforms. Existing AFDM research focuses on wireless communication at high carrier frequency (CF), which typically considers only Doppler frequency shift (DFS) as a resul…
▽ More
Affine Frequency Division Multiplexing (AFDM) is a brand new chirp-based multi-carrier (MC) waveform for high mobility communications, with promising advantages over Orthogonal Frequency Division Multiplexing (OFDM) and other MC waveforms. Existing AFDM research focuses on wireless communication at high carrier frequency (CF), which typically considers only Doppler frequency shift (DFS) as a result of mobility, while ignoring the accompanied Doppler time scaling (DTS) on waveform. However, for underwater acoustic (UWA) communication at much lower CF and propagating at speed of sound, the DTS effect could not be ignored and poses significant challenges for channel estimation. This paper analyzes the channel frequency response (CFR) of AFDM under multi-scale multi-lag (MSML) channels, where each propagating path could have different delay and DFS/DTS. Based on the newly derived input-output formula and its characteristics, two new channel estimation methods are proposed, i.e., AFDM with iterative multi-index (AFDM-IMI) estimation under low to moderate DTS, and AFDM with orthogonal matching pursuit (AFDM-OMP) estimation under high DTS. Numerical results confirm the effectiveness of the proposed methods against the original AFDM channel estimation method. Moreover, the resulted AFDM system outperforms OFDM as well as Orthogonal Chirp Division Multiplexing (OCDM) in terms of channel estimation accuracy and bit error rate (BER), which is consistent with our theoretical analysis based on CFR overlap probability (COP), mutual incoherent property (MIP) and channel diversity gain under MSML channels.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding
Authors:
Yiwen Tang,
Ray Zhang,
Jiaming Liu,
Zoey Guo,
Dong Wang,
Zhigang Wang,
Bin Zhao,
Shanghang Zhang,
Peng Gao,
Hongsheng Li,
Xuelong Li
Abstract:
Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantl…
▽ More
Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method. Code and models are released at https://github.com/Ivan-Tang-3D/Any2Point.
△ Less
Submitted 30 May, 2024; v1 submitted 11 April, 2024;
originally announced April 2024.
-
CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
Authors:
Leying Zhang,
Yao Qian,
Long Zhou,
Shujie Liu,
Dongmei Wang,
Xiaofei Wang,
Midia Yousefi,
Yanmin Qian,
**yu Li,
Lei He,
Sheng Zhao,
Michael Zeng
Abstract:
Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-rou…
▽ More
Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. This is exemplified by instances generated in a single channel where one speaker's utterance is seamlessly mixed with another's interjections or laughter, indicating the latter's role as an attentive listener. Audio samples are available at https://aka.ms/covomix.
△ Less
Submitted 29 May, 2024; v1 submitted 9 April, 2024;
originally announced April 2024.
-
Network-Assisted Full-Duplex Cell-Free mmWave Networks: Hybrid MIMO Processing and Multi-Agent DRL-Based Power Allocation
Authors:
Qingrui Fan,
Yu Zhang,
Jiamin Li,
Dongming Wang,
Hongbiao Zhang,
Xiaohu You
Abstract:
This paper investigates the network-assisted full-duplex (NAFD) cell-free millimeter-wave (mmWave) networks, where the distribution of the transmitting access points (T-APs) and receiving access points (R-APs) across distinct geographical locations mitigates cross-link interference, facilitating the attainment of a truly flexible duplex mode. To curtail deployment expenses and power consumption fo…
▽ More
This paper investigates the network-assisted full-duplex (NAFD) cell-free millimeter-wave (mmWave) networks, where the distribution of the transmitting access points (T-APs) and receiving access points (R-APs) across distinct geographical locations mitigates cross-link interference, facilitating the attainment of a truly flexible duplex mode. To curtail deployment expenses and power consumption for mmWave band operations, each AP incorporates a hybrid digital-analog structure encompassing precoder/combiner functions. However, this incorporation introduces processing intricacies within channel estimation and precoding/combining design. In this paper, we first present a hybrid multiple-input multiple-output (MIMO) processing framework and derive explicit expressions for both uplink and downlink achievable rates. Then we formulate a power allocation problem to maximize the weighted bidirectional sum rates. To tackle this non-convex problem, we develop a collaborative multi-agent deep reinforcement learning (MADRL) algorithm called multi-agent twin delayed deep deterministic policy gradient (MATD3) for NAFD cell-free mmWave networks. Specifically, given the tightly coupled nature of both uplink and downlink power coefficients in NAFD cell-free mmWave networks, the MATD3 algorithm resolves such coupled conflicts through an interactive learning process between agents and the environment. Finally, the simulation results validate the effectiveness of the proposed channel estimation methods within our hybrid MIMO processing paradigm, and demonstrate that our MATD3 algorithm outperforms both multi-agent deep deterministic policy gradient (MADDPG) and conventional power allocation strategies.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
LIBR+: Improving Intraoperative Liver Registration by Learning the Residual of Biomechanics-Based Deformable Registration
Authors:
Dingrong Wang,
Soheil Azadvar,
Jon Heiselman,
Xiajun Jiang,
Michael Miga,
Linwei Wang
Abstract:
The surgical environment imposes unique challenges to the intraoperative registration of organ shapes to their preoperatively-imaged geometry. Biomechanical model-based registration remains popular, while deep learning solutions remain limited due to the sparsity and variability of intraoperative measurements and the limited ground-truth deformation of an organ that can be obtained during the surg…
▽ More
The surgical environment imposes unique challenges to the intraoperative registration of organ shapes to their preoperatively-imaged geometry. Biomechanical model-based registration remains popular, while deep learning solutions remain limited due to the sparsity and variability of intraoperative measurements and the limited ground-truth deformation of an organ that can be obtained during the surgery. In this paper, we propose a novel \textit{hybrid} registration approach that leverage a linearized iterative boundary reconstruction (LIBR) method based on linear elastic biomechanics, and use deep neural networks to learn its residual to the ground-truth deformation (LIBR+). We further formulate a dual-branch spline-residual graph convolutional neural network (SR-GCN) to assimilate information from sparse and variable intraoperative measurements and effectively propagate it through the geometry of the 3D organ. Experiments on a large intraoperative liver registration dataset demonstrated the consistent improvements achieved by LIBR+ in comparison to existing rigid, biomechnical model-based non-rigid, and deep-learning based non-rigid approaches to intraoperative liver registration.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR
Authors:
Yufeng Yang,
Ashutosh Pandey,
DeLiang Wang
Abstract:
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has ma…
▽ More
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. The proposed systems fully decouple frontend enhancement and backend ASR trained only on clean speech. Results on the WSJ, CHiME-2, LibriSpeech, and CHiME-4 corpora demonstrate that ARN and CrossNet enhanced speech both translate to improved ASR results in noisy and reverberant environments, and generalize well to real acoustic scenarios. The proposed system outperforms the baselines trained on corrupted speech directly. Furthermore, it cuts the previous best word error rate (WER) on CHiME-2 by $28.4\%$ relatively with a $5.57\%$ WER, and achieves $3.32/4.44\%$ WER on single-channel CHiME-4 simulated/real test data without training on CHiME-4.
△ Less
Submitted 10 March, 2024;
originally announced March 2024.
-
CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation
Authors:
Vahid Ahmadi Kalkhorani,
DeLiang Wang
Abstract:
We introduce CrossNet, a complex spectral map** approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To addre…
▽ More
We introduce CrossNet, a complex spectral map** approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Target Localization and Performance Trade-Offs in Cooperative ISAC Systems: A Scheme Based on 5G NR OFDM Signals
Authors:
Zhenkun Zhang,
Hong Ren,
Cunhua Pan,
Sheng Hong,
Dongming Wang,
Jiangzhou Wang,
Xiaohu You
Abstract:
The integration of sensing capabilities into communication systems, by sharing physical resources, has a significant potential for reducing spectrum, hardware, and energy costs while inspiring innovative applications. Cooperative networks, in particular, are expected to enhance sensing services by enlarging the coverage area and enriching sensing measurements, thus improving the service availabili…
▽ More
The integration of sensing capabilities into communication systems, by sharing physical resources, has a significant potential for reducing spectrum, hardware, and energy costs while inspiring innovative applications. Cooperative networks, in particular, are expected to enhance sensing services by enlarging the coverage area and enriching sensing measurements, thus improving the service availability and accuracy. This paper proposes a cooperative integrated sensing and communication (ISAC) framework by leveraging information-carrying orthogonal frequency division multiplexing (OFDM) signals transmitted by access points (APs). Specifically, we propose a two-stage scheme for target localization, where communication signals are reused as sensing reference signals based on the system information shared at the central processing unit (CPU). In Stage I, we measure the ranges of scattered paths induced by targets, through the extraction of time-delay information from the received signals at APs. Then, the target locations are estimated in Stage II based on these range measurements. Considering that the scattered paths corresponding to some targets may not be detectable by all APs, we propose an effective algorithm to match the range measurements with the targets and achieve the target location estimation. Notably, by analyzing the OFDM numerologies defined in fifth generation (5G) standards, we elucidate the flexibility and consistency of performance trade-offs in both communication and sensing aspects. Finally, numerical results confirm the effectiveness of our sensing scheme and the cooperative gain of the ISAC framework.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Transfer Learning-Enhanced Instantaneous Multi-Person Indoor Localization by CSI
Authors:
Zhiyuan He,
Ke Deng,
Jiangchao Gong,
Yi Zhou,
Desheng Wang
Abstract:
Passive indoor localization, integral to smart buildings, emergency response, and indoor navigation, has traditionally been limited by a focus on single-target localization and reliance on multi-packet CSI. We introduce a novel Multi-target loss, notably enhancing multi-person localization. Utilizing this loss function, our instantaneous CSI-ResNet achieves an impressive 99.21% accuracy at 0.6m pr…
▽ More
Passive indoor localization, integral to smart buildings, emergency response, and indoor navigation, has traditionally been limited by a focus on single-target localization and reliance on multi-packet CSI. We introduce a novel Multi-target loss, notably enhancing multi-person localization. Utilizing this loss function, our instantaneous CSI-ResNet achieves an impressive 99.21% accuracy at 0.6m precision with single-timestamp CSI. A preprocessing algorithm is implemented to counteract WiFi-induced variability, thereby augmenting robustness. Furthermore, we incorporate Nuclear Norm-Based Transfer Pre-Training, ensuring adaptability in diverse environments, which provides a new paradigm for indoor multi-person localization. Additionally, we have developed an extensive dataset, surpassing existing ones in scope and diversity, to underscore the efficacy of our method and facilitate future fingerprint-based localization research.
△ Less
Submitted 2 March, 2024;
originally announced March 2024.
-
Analysis of Processing Pipelines for Indoor Human Tracking using FMCW radar
Authors:
Dingyang Wang,
Francesco Fioranelli,
Alexander Yarovoy
Abstract:
In this paper, the problem of formulating effective processing pipelines for indoor human tracking is investigated, with the usage of a Multiple Input Multiple Output (MIMO) Frequency Modulated Continuous Wave (FMCW) radar. Specifically, two processing pipelines starting with detections on the Range-Azimuth (RA) maps and the Range-Doppler (RD) maps are formulated and compared, together with subseq…
▽ More
In this paper, the problem of formulating effective processing pipelines for indoor human tracking is investigated, with the usage of a Multiple Input Multiple Output (MIMO) Frequency Modulated Continuous Wave (FMCW) radar. Specifically, two processing pipelines starting with detections on the Range-Azimuth (RA) maps and the Range-Doppler (RD) maps are formulated and compared, together with subsequent clustering and tracking algorithms and their relevant parameters. Experimental results are presented to validate and assess both pipelines, using a 24 GHz commercial radar platform with 250 MHz bandwidth and 15 virtual channels. Scenarios where 1 and 2 people move in an indoor environment are considered, and the influence of the number of virtual channels and detectors' parameters is discussed. The characteristics and limitations of both pipelines are presented, with the approach based on detections on RA maps showing in general more robust results.
△ Less
Submitted 15 March, 2024; v1 submitted 29 February, 2024;
originally announced February 2024.
-
Hy-DAT: A Tool to Address Hydropower Modeling Gaps Using Interdependency, Efficiency Curves, and Unit Dispatch Models
Authors:
Dewei Wang,
Bhaskar Mitra,
Sameer Nekkalapu,
Sohom Datta,
Bibi Matthew,
Rounak Meyur,
Heng Wang,
Slaven Kincic
Abstract:
As the power system continues to be flooded with intermittent resources, it becomes more important to accurately assess the role of hydro and its impact on the power grid. While hydropower generation has been studied for decades, dependency of power generation on water availability and constraints in hydro operation are not well represented in power system models used in the planning and operation…
▽ More
As the power system continues to be flooded with intermittent resources, it becomes more important to accurately assess the role of hydro and its impact on the power grid. While hydropower generation has been studied for decades, dependency of power generation on water availability and constraints in hydro operation are not well represented in power system models used in the planning and operation of large-scale interconnection studies. There are still multiple modeling gaps that need to be addressed; if not, they can lead to inaccurate operation and planning reliability studies, and consequently to unintentional load shedding or even blackouts. As a result, it is very important that hydropower is represented correctly in both steady-state and dynamic power system studies. In this paper, we discuss the development and use of the Hydrological Dispatch and Analysis Tool (Hy-DAT) as an interactive graphical user interface, that uses a novel methodology to address the hydropower modeling gaps like water availability and interdependency using a database and algorithms to generate accurate representative models for power system simulation.
△ Less
Submitted 5 March, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
Online Mean Estimation for Multi-frame Optical Fiber Signals On Highways
Authors:
Linlin Wang,
Mingxue Quan,
Wei Wang,
Dezhao Wang,
Shanwen Wang
Abstract:
In the era of Big Data, prompt analysis and processing of data sets is critical. Meanwhile, statistical methods provide key tools and techniques to extract valuable insights and knowledge from complex data sets. This paper creatively applies statistical methods to the field of traffic, particularly focusing on the preprocessing of multi-frame signals obtained by optical fiber-based Distributed Aco…
▽ More
In the era of Big Data, prompt analysis and processing of data sets is critical. Meanwhile, statistical methods provide key tools and techniques to extract valuable insights and knowledge from complex data sets. This paper creatively applies statistical methods to the field of traffic, particularly focusing on the preprocessing of multi-frame signals obtained by optical fiber-based Distributed Acoustic Sensing (DAS) system. An online non-parametric regression model based on Local Polynomial Regression (LPR) and variable bandwidth selection is employed to dynamically update the estimation of mean function as signals flow in. This mean estimation method can derive average information of multi-frame fiber signals, thus providing the basis for the subsequent vehicle trajectory extraction algorithms. To further evaluate the effectiveness of the proposed method, comparison experiments were conducted under real highway scenarios, showing that our approach not only deals with multi-frame signals more accurately than the classical filter-based Kalman and Wavelet methods, but also meets the needs better under the condition of saving memory and rapid responses. It provides a new reliable means for signal processing which can be integrated with other existing methods.
△ Less
Submitted 22 May, 2024; v1 submitted 20 January, 2024;
originally announced February 2024.
-
Traffic Flow and Speed Monitoring Based On Optical Fiber Distributed Acoustic Sensor
Authors:
Linlin Wang,
Shixin Wang,
Peng Wang,
Wei Wang,
Dezhao Wang,
Yongcai Wang,
Shanwen Wang
Abstract:
In the realm of intelligent transportation systems, accurate and reliable traffic monitoring is crucial. Traditional devices, such as cameras and lidars, face limitations in adverse weather conditions and complex traffic scenarios, prompting the need for more resilient technologies. This paper presents traffic flow monitoring method using optical fiber-based Distributed Acoustic Sensors (DAS). An…
▽ More
In the realm of intelligent transportation systems, accurate and reliable traffic monitoring is crucial. Traditional devices, such as cameras and lidars, face limitations in adverse weather conditions and complex traffic scenarios, prompting the need for more resilient technologies. This paper presents traffic flow monitoring method using optical fiber-based Distributed Acoustic Sensors (DAS). An innovative vehicle trajectory extraction algorithm is proposed to derive traffic flow statistics. In the processing of optical fiber waterfall diagrams, Butterworth low-pass filter and peaks location search method are employed to determine the entry position of vehicles. Subsequently, line-by-line matching algorithm is proposed to effectively track the trajectories. Experiments were conducted under both real highway and tunnel scenarios, showing that our approach not only extracts vehicle trajectories more accurately than the classical Hough and Radon transform-based methods, but also facilitates the calculation of traffic flow information using the low-cost acoustic sensors. It provides a new reliable means for traffic flow monitoring which can be integrated with existing methods.
△ Less
Submitted 20 January, 2024;
originally announced February 2024.
-
How phonemes contribute to deep speaker models?
Authors:
Pengqi Li,
Tianhao Wang,
Lantian Li,
Askar Hamdulla,
Dong Wang
Abstract:
Which phonemes convey more speaker traits is a long-standing question, and various perception experiments were conducted with human subjects. For speaker recognition, studies were conducted with the conventional statistical models and the drawn conclusions are more or less consistent with the perception results. However, which phonemes are more important with modern deep neural models is still une…
▽ More
Which phonemes convey more speaker traits is a long-standing question, and various perception experiments were conducted with human subjects. For speaker recognition, studies were conducted with the conventional statistical models and the drawn conclusions are more or less consistent with the perception results. However, which phonemes are more important with modern deep neural models is still unexplored, due to the opaqueness of the decision process. This paper conducts a novel study for the attribution of phonemes with two types of deep speaker models that are based on TDNN and CNN respectively, from the perspective of model explanation. Specifically, we conducted the study by two post-explanation methods: LayerCAM and Time Align Occlusion (TAO). Experimental results showed that: (1) At the population level, vowels are more important than consonants, confirming the human perception studies. However, fricatives are among the most unimportant phonemes, which contrasts with previous studies. (2) At the speaker level, a large between-speaker variation is observed regarding phoneme importance, indicating that whether a phoneme is important or not is largely speaker-dependent.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Adversarial Data Augmentation for Robust Speaker Verification
Authors:
Zhenyu Zhou,
Junhui Chen,
Namin Wang,
Lantian Li,
Dong Wang
Abstract:
Data augmentation (DA) has gained widespread popularity in deep speaker models due to its ease of implementation and significant effectiveness. It enriches training data by simulating real-life acoustic variations, enabling deep neural networks to learn speaker-related representations while disregarding irrelevant acoustic variations, thereby improving robustness and generalization. However, a pot…
▽ More
Data augmentation (DA) has gained widespread popularity in deep speaker models due to its ease of implementation and significant effectiveness. It enriches training data by simulating real-life acoustic variations, enabling deep neural networks to learn speaker-related representations while disregarding irrelevant acoustic variations, thereby improving robustness and generalization. However, a potential issue with the vanilla DA is augmentation residual, i.e., unwanted distortion caused by different types of augmentation. To address this problem, this paper proposes a novel approach called adversarial data augmentation (A-DA) which combines DA with adversarial learning. Specifically, it involves an additional augmentation classifier to categorize various augmentation types used in data augmentation. This adversarial learning empowers the network to generate speaker embeddings that can deceive the augmentation classifier, making the learned speaker embeddings more robust in the face of augmentation variations. Experiments conducted on VoxCeleb and CN-Celeb datasets demonstrate that our proposed A-DA outperforms standard DA in both augmentation matched and mismatched test conditions, showcasing its superior robustness and generalization against acoustic variations.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction
Authors:
Xueyuan Chen,
Yuejiao Wang,
Xixin Wu,
Disong Wang,
Zhiyong Wu,
Xunying Liu,
Helen Meng
Abstract:
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech by improving the intelligibility and naturalness. This is a challenging task especially for patients with severe dysarthria and speaking in complex, noisy acoustic environments. To address these challenges, we propose a novel multi-modal framework to utilize visual information, e.g., lip movements, in DSR…
▽ More
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech by improving the intelligibility and naturalness. This is a challenging task especially for patients with severe dysarthria and speaking in complex, noisy acoustic environments. To address these challenges, we propose a novel multi-modal framework to utilize visual information, e.g., lip movements, in DSR as extra clues for reconstructing the highly abnormal pronunciations. The multi-modal framework consists of: (i) a multi-modal encoder to extract robust phoneme embeddings from dysarthric speech with auxiliary visual features; (ii) a variance adaptor to infer the normal phoneme duration and pitch contour from the extracted phoneme embeddings; (iii) a speaker encoder to encode the speaker's voice characteristics; and (iv) a mel-decoder to generate the reconstructed mel-spectrogram based on the extracted phoneme embeddings, prosodic features and speaker embeddings. Both objective and subjective evaluations conducted on the commonly used UASpeech corpus show that our proposed approach can achieve significant improvements over baseline systems in terms of speech intelligibility and naturalness, especially for the speakers with more severe symptoms. Compared with original dysarthric speech, the reconstructed speech achieves 42.1\% absolute word error rate reduction for patients with more severe dysarthria levels.
△ Less
Submitted 31 January, 2024;
originally announced January 2024.
-
UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization
Authors:
Yuejiao Wang,
Xixin Wu,
Disong Wang,
Lingwei Meng,
Helen Meng
Abstract:
Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech. The technology eases communication with speakers affected by the neuromotor disorder and enhances their social inclusion. NED-based (Neural Encoder-Decoder) systems have significantly improved the intelligibility of the reconstructed speech as compared with GAN-based (Generati…
▽ More
Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech. The technology eases communication with speakers affected by the neuromotor disorder and enhances their social inclusion. NED-based (Neural Encoder-Decoder) systems have significantly improved the intelligibility of the reconstructed speech as compared with GAN-based (Generative Adversarial Network) approaches, but the approach is still limited by training inefficiency caused by the cascaded pipeline and auxiliary tasks of the content encoder, which may in turn affect the quality of reconstruction. Inspired by self-supervised speech representation learning and discrete speech units, we propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement and utilizes speech units to constrain the dysarthric content restoration in a discrete linguistic space. Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks. Results on the UASpeech corpus indicate that Unit-DSR outperforms competitive baselines in terms of content restoration, reaching a 28.2% relative average word error rate reduction when compared to original dysarthric speech, and shows robustness against speed perturbation and noise.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
Combined Generative and Predictive Modeling for Speech Super-resolution
Authors:
Heming Wang,
Eric W. Healy,
DeLiang Wang
Abstract:
Speech super-resolution (SR) is the task that restores high-resolution speech from low-resolution input. Existing models employ simulated data and constrained experimental settings, which limit generalization to real-world SR. Predictive models are known to perform well in fixed experimental settings, but can introduce artifacts in adverse conditions. On the other hand, generative models learn the…
▽ More
Speech super-resolution (SR) is the task that restores high-resolution speech from low-resolution input. Existing models employ simulated data and constrained experimental settings, which limit generalization to real-world SR. Predictive models are known to perform well in fixed experimental settings, but can introduce artifacts in adverse conditions. On the other hand, generative models learn the distribution of target data and have a better capacity to perform well on unseen conditions. In this study, we propose a novel two-stage approach that combines the strengths of predictive and generative models. Specifically, we employ a diffusion-based model that is conditioned on the output of a predictive model. Our experiments demonstrate that the model significantly outperforms single-stage counterparts and existing strong baselines on benchmark SR datasets. Furthermore, we introduce a repainting technique during the inference of the diffusion process, enabling the proposed model to regenerate high-frequency components even in mismatched conditions. An additional contribution is the collection of and evaluation on real SR recordings, using the same microphone at different native sampling rates. We make this dataset freely accessible, to accelerate progress towards real-world speech super-resolution.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
RIS-Enabled Integrated Sensing and Communication for 6G Systems
Authors:
Dexin Wang,
Ahmad Bazzi,
Marwa Chafii
Abstract:
The following paper proposes a new target localization system design using an architecture based on reconfigurable intelligent surfaces (RISs) and passive radars (PRs) for integrated sensing and communications systems. The preamble of the communication signal is exploited in order to perform target sensing tasks, which involve detection and localization. The RIS in this case can aid the PR in sens…
▽ More
The following paper proposes a new target localization system design using an architecture based on reconfigurable intelligent surfaces (RISs) and passive radars (PRs) for integrated sensing and communications systems. The preamble of the communication signal is exploited in order to perform target sensing tasks, which involve detection and localization. The RIS in this case can aid the PR in sensing targets that are otherwise not seen by the PR itself, due to the many obstacles encountered within the propagation channel. Therefore, this work proposes a localization algorithm tailored for the integrated sensing and communications RIS-aided architecture, which is capable of uniquely positioning targets within the scene. The algorithm is capable of detecting the number of targets along with estimating the position of targets via angles and times of arrival. Our simulation results demonstrate the performance of the localization method in terms of different localization and detection metrics and for increasing RIS sizes.
△ Less
Submitted 31 December, 2023;
originally announced January 2024.
-
Implementing Digital Twin in Field-Deployed Optical Networks: Uncertain Factors, Operational Guidance, and Field-Trial Demonstration
Authors:
Yuchen Song,
Min Zhang,
Yao Zhang,
Yan Shi,
Shikui Shen,
Bingli Guo,
Shanguo Huang,
Danshi Wang
Abstract:
Digital twin has revolutionized optical communication networks by enabling their full life-cycle management, including design, troubleshooting, optimization, upgrade, and prediction. While extensive literature exists on frameworks, standards, and applications of digital twin, there is a pressing need in implementing digital twin in field-deployed optical networks operating in real-world environmen…
▽ More
Digital twin has revolutionized optical communication networks by enabling their full life-cycle management, including design, troubleshooting, optimization, upgrade, and prediction. While extensive literature exists on frameworks, standards, and applications of digital twin, there is a pressing need in implementing digital twin in field-deployed optical networks operating in real-world environments, as opposed to controlled laboratory settings. This paper addresses this challenge by examining the uncertain factors behind the inaccuracy of digital twin in field-deployed optical networks from three main challenges and proposing operational guidance for implementing accurate digital twin in field-deployed optical networks. Through the proposed guidance, we demonstrate the effective implementation of digital twin in a field-trial C+L-band optical transmission link, showcasing its capabilities in performance recovery in a fiber cut scenario.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Leveraging Laryngograph Data for Robust Voicing Detection in Speech
Authors:
Yixuan Zhang,
Heming Wang,
DeLiang Wang
Abstract:
Accurately detecting voiced intervals in speech signals is a critical step in pitch tracking and has numerous applications. While conventional signal processing methods and deep learning algorithms have been proposed for this task, their need to fine-tune threshold parameters for different datasets and limited generalization restrict their utility in real-world applications. To address these chall…
▽ More
Accurately detecting voiced intervals in speech signals is a critical step in pitch tracking and has numerous applications. While conventional signal processing methods and deep learning algorithms have been proposed for this task, their need to fine-tune threshold parameters for different datasets and limited generalization restrict their utility in real-world applications. To address these challenges, this study proposes a supervised voicing detection model that leverages recorded laryngograph data. The model is based on a densely-connected convolutional recurrent neural network (DC-CRN), and trained on data with reference voicing decisions extracted from laryngograph data sets. Pretraining is also investigated to improve the generalization ability of the model. The proposed model produces robust voicing detection results, outperforming other strong baseline methods, and generalizes well to unseen datasets. The source code of the proposed model with pretraining is provided along with the list of used laryngograph datasets to facilitate further research in this area.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Combating Multi-path Interference to Improve Chirp-based Underwater Acoustic Communication
Authors:
Wenjun Xie,
Enqi Zhang,
Lizhao You,
Deqing Wang,
Zhaorui Wang,
Liqun Fu
Abstract:
Linear chirp-based underwater acoustic communication has been widely used due to its reliability and long-range transmission capability. However, unlike the counterpart chirp technology in wireless -- LoRa, its throughput is severely limited by the number of modulated chirps in a symbol. The fundamental challenge lies in the underwater multi-path channel, where the delayed copied of one symbol may…
▽ More
Linear chirp-based underwater acoustic communication has been widely used due to its reliability and long-range transmission capability. However, unlike the counterpart chirp technology in wireless -- LoRa, its throughput is severely limited by the number of modulated chirps in a symbol. The fundamental challenge lies in the underwater multi-path channel, where the delayed copied of one symbol may cause inter-symbol and intra-symbol interfere. In this paper, we present UWLoRa+, a system that realizes the same chirp modulation as LoRa with higher data rate, and enhances LoRa's design to address the multi-path challenge via the following designs: a) we replace the linear chirp used by LoRa with the non-linear chirp to reduce the signal interference range and the collision probability; b) we design an algorithm that first demodulates each path and then combines the demodulation results of detected paths; and c) we replace the Hamming codes used by LoRa with the non-binary LDPC codes to mitigate the impact of the inevitable collision.Experiment results show that the new designs improve the bit error rate (BER) by 3x, and the packet error rate (PER) significantly, compared with the LoRa's naive design. Compared with an state-of-the-art system for decoding underwater LoRa chirp signal, UWLoRa+ improves the throughput by up to 50 times.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
A Short Overview of 6G V2X Communication Standards
Authors:
Donglin Wang,
Yann Nana Nganso,
Hans D. Schotten
Abstract:
We are on the verge of a new age of linked autonomous cars with unheard-of user experiences, dramatically improved air quality and road safety, extremely varied transportation settings, and a plethora of cutting-edge apps. A substantially improved Vehicle-to-Everything (V2X) communication network that can simultaneously support massive hyper-fast, ultra-reliable, and low-latency information exchan…
▽ More
We are on the verge of a new age of linked autonomous cars with unheard-of user experiences, dramatically improved air quality and road safety, extremely varied transportation settings, and a plethora of cutting-edge apps. A substantially improved Vehicle-to-Everything (V2X) communication network that can simultaneously support massive hyper-fast, ultra-reliable, and low-latency information exchange is necessary to achieve this ambitious goal. These needs of the upcoming V2X are expected to be satisfied by the Sixth Generation (6G) communication system. In this article, we start by introducing the history of V2X communications by giving details on the current, develo**, and future developments. We compare the applications of communication technologies such as Wi-Fi, LTE, 5G, and 6G. we focus on the new technologies for 6G V2X which are brain-vehicle interface, blocked-based V2X, and Machine Learning (ML). To achieve this, we provide a summary of the most recent ML developments in 6G vehicle networks. we discuss the security challenges of 6G V2X. We address the strengths, open challenges, development, and improving areas of further study in this field.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
A Fast Power Spectrum Sensing Solution for Generalized Coprime Sampling
Authors:
Kaili Jiang,
Dechang Wang,
Kailun Tian,
Hancong Feng,
Yuxin Zhao,
Junyu Yuan,
Bin Tang
Abstract:
The growing scarcity of spectrum resources, wideband spectrum sensing is required to process a prohibitive volume of data at a high sampling rate. For some applications, spectrum estimation only requires second-order statistics. In this case, a fast power spectrum sensing solution is proposed based on the generalized coprime sampling. By exploring the sensing vector inherent structure, the autocor…
▽ More
The growing scarcity of spectrum resources, wideband spectrum sensing is required to process a prohibitive volume of data at a high sampling rate. For some applications, spectrum estimation only requires second-order statistics. In this case, a fast power spectrum sensing solution is proposed based on the generalized coprime sampling. By exploring the sensing vector inherent structure, the autocorrelation sequence of inputs can be reconstructed from sub-Nyquist samples by only utilizing the parallel Fourier transform and simple multiplication operations. Thus, it takes less time than the state-of-the-art methods while maintaining the same performance, and it achieves higher performance than the existing methods within the same execution time, without the need for pre-estimating the number of inputs. Furthermore, the influence of the model mismatch has only a minor impact on the estimation performance, which allows for more efficient use of the spectrum resource in a distributed swarm scenario. Simulation results demonstrate the low complexity in sampling and computation, making it a more practical solution for real-time and distributed wideband spectrum sensing applications.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Physics-driven generative adversarial networks empower single-pixel infrared hyperspectral imaging
Authors:
Dong-Yin Wang,
Shu-Hang Bie,
Xi-Hao Chen,
Wen-Kai Yu
Abstract:
A physics-driven generative adversarial network (GAN) was established here for single-pixel hyperspectral imaging (HSI) in the infrared spectrum, to eliminate the extensive data training work required by traditional data-driven model. Within the GAN framework, the physical process of single-pixel imaging (SPI) was integrated into the generator, and the actual and estimated one-dimensional (1D) buc…
▽ More
A physics-driven generative adversarial network (GAN) was established here for single-pixel hyperspectral imaging (HSI) in the infrared spectrum, to eliminate the extensive data training work required by traditional data-driven model. Within the GAN framework, the physical process of single-pixel imaging (SPI) was integrated into the generator, and the actual and estimated one-dimensional (1D) bucket signals were employed as constraints in the objective function to update the network's parameters and optimize the generator with the assistance of the discriminator. In comparison to single-pixel infrared HSI methods based on compressed sensing and physics-driven convolution neural networks, our physics-driven GAN-based single-pixel infrared HSI can achieve higher imaging performance but with fewer measurements. We believe that this physics-driven GAN will promote practical applications of computational imaging, especially various SPI-based techniques.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Multi-channel Conversational Speaker Separation via Neural Diarization
Authors:
Hassan Taherian,
DeLiang Wang
Abstract:
When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems substantially degrades as they are designed for single-talker speech. To enhance ASR performance in conversational or meeting environments, continuous speaker separation (CSS) is commonly employed. However, CSS requires a short separation window to avoid many speakers inside the window and sequential…
▽ More
When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems substantially degrades as they are designed for single-talker speech. To enhance ASR performance in conversational or meeting environments, continuous speaker separation (CSS) is commonly employed. However, CSS requires a short separation window to avoid many speakers inside the window and sequential grou** of discontinuous speech segments. To address these limitations, we introduce a new multi-channel framework called "speaker separation via neural diarization" (SSND) for meeting environments. Our approach utilizes an end-to-end diarization system to identify the speech activity of each individual speaker. By leveraging estimated speaker boundaries, we generate a sequence of embeddings, which in turn facilitate the assignment of speakers to the outputs of a multi-talker separation model. SSND addresses the permutation ambiguity issue of talker-independent speaker separation during the diarization phase through location-based training, rather than during the separation process. This unique approach allows multiple non-overlapped speakers to be assigned to the same output stream, making it possible to efficiently process long segments-a task impossible with CSS. Additionally, SSND is naturally suitable for speaker-attributed ASR. We evaluate our proposed diarization and separation methods on the open LibriCSS dataset, advancing state-of-the-art diarization and ASR results by a large margin.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Synchrophasor Data Anomaly Detection on Grid Edge by 5G Communication and Adjacent Compute
Authors:
Chuan Qin,
Dexin Wang,
Kishan Prudhvi Guddanti,
Xiaoyuan Fan,
Zhangshuan Hou
Abstract:
The fifth-generation mobile communication (5G) technology offers opportunities to enhance the real-time monitoring of grids. The 5G-enabled phasor measurement units (PMUs) feature flexible positioning and cost-effective long-term maintenance without the constraints of fixing wires. This paper is the first to demonstrate the applicability of 5G in PMU communication, and the experiment was carried o…
▽ More
The fifth-generation mobile communication (5G) technology offers opportunities to enhance the real-time monitoring of grids. The 5G-enabled phasor measurement units (PMUs) feature flexible positioning and cost-effective long-term maintenance without the constraints of fixing wires. This paper is the first to demonstrate the applicability of 5G in PMU communication, and the experiment was carried out at Verizon non-standalone test-bed at Pacific Northwest National Laboratory (PNNL) Advanced Wireless Communication lab. The performance of the 5G-enabled PMU communication setup is reviewed and discussed in this paper, and a generalized dynamic linear model (GDLM) based real-time synchrophasor data anomaly detection use-case is presented. Last but not least, the practicability of implementing 5G for wide-area protection strategies is explored and discussed by analyzing the experimental results.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Passive Integrated Sensing and Communication Scheme based on RF Fingerprint Information Extraction for Cell-Free RAN
Authors:
**gxuan Yu,
Fan Zeng,
Jiamin Li,
Feiyang Liu,
Pengcheng Zhu,
Dongming Wang,
Xiaohu You
Abstract:
This paper investigates how to achieve integrated sensing and communication (ISAC) based on a cell-free radio access network (CF-RAN) architecture with a minimum footprint of communication resources. We propose a new passive sensing scheme. The scheme is based on the radio frequency (RF) fingerprint learning of the RF radio unit (RRU) to build an RF fingerprint library of RRUs. The source RRU is i…
▽ More
This paper investigates how to achieve integrated sensing and communication (ISAC) based on a cell-free radio access network (CF-RAN) architecture with a minimum footprint of communication resources. We propose a new passive sensing scheme. The scheme is based on the radio frequency (RF) fingerprint learning of the RF radio unit (RRU) to build an RF fingerprint library of RRUs. The source RRU is identified by comparing the RF fingerprints carried by the signal at the receiver side. The receiver extracts the channel parameters from the signal and estimates the channel environment, thus locating the reflectors in the environment. The proposed scheme can effectively solve the problem of interference between signals in the same time-frequency domain but in different spatial domains when multiple RRUs jointly serve users in CF-RAN architecture. Simulation results show that the proposed passive ISAC scheme can effectively detect reflector location information in the environment without degrading the communication performance.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Integrated Sensing and Communication for Network-Assisted Full-Duplex Cell-Free Distributed Massive MIMO Systems
Authors:
Fan Zeng,
**gxuan Yu,
Jiamin Li,
Feiyang Liu,
Dongming Wang,
Xiaohu You
Abstract:
In this paper, we combine the network-assisted full-duplex (NAFD) technology and distributed radar sensing to implement integrated sensing and communication (ISAC). The ISAC system features both uplink and downlink remote radio units (RRUs) equipped with communication and sensing capabilities. We evaluate the communication and sensing performance of the system using the sum communication rates and…
▽ More
In this paper, we combine the network-assisted full-duplex (NAFD) technology and distributed radar sensing to implement integrated sensing and communication (ISAC). The ISAC system features both uplink and downlink remote radio units (RRUs) equipped with communication and sensing capabilities. We evaluate the communication and sensing performance of the system using the sum communication rates and the Cramer-Rao lower bound (CRLB), respectively. We compare the performance of the proposed scheme with other ISAC schemes, the result shows that the proposed scheme can provide more stable sensing and better communication performance. Furthermore, we propose two power allocation algorithms to optimize the communication and sensing performance jointly. One algorithm is based on the deep Q-network (DQN) and the other one is based on the non-dominated sorting genetic algorithm II (NSGA-II). The proposed algorithms provide more feasible solutions and achieve better system performance than the equal power allocation algorithm.
△ Less
Submitted 13 November, 2023; v1 submitted 8 November, 2023;
originally announced November 2023.
-
LoMAE: Low-level Vision Masked Autoencoders for Low-dose CT Denoising
Authors:
Dayang Wang,
Yongshun Xu,
Shuo Han,
Zhan Wu,
Li Zhou,
Bahareh Morovati,
Hengyong Yu
Abstract:
Low-dose computed tomography (LDCT) offers reduced X-ray radiation exposure but at the cost of compromised image quality, characterized by increased noise and artifacts. Recently, transformer models emerged as a promising avenue to enhance LDCT image quality. However, the success of such models relies on a large amount of paired noisy and clean images, which are often scarce in clinical settings.…
▽ More
Low-dose computed tomography (LDCT) offers reduced X-ray radiation exposure but at the cost of compromised image quality, characterized by increased noise and artifacts. Recently, transformer models emerged as a promising avenue to enhance LDCT image quality. However, the success of such models relies on a large amount of paired noisy and clean images, which are often scarce in clinical settings. In the fields of computer vision and natural language processing, masked autoencoders (MAE) have been recognized as an effective label-free self-pretraining method for transformers, due to their exceptional feature representation ability. However, the original pretraining and fine-tuning design fails to work in low-level vision tasks like denoising. In response to this challenge, we redesign the classical encoder-decoder learning model and facilitate a simple yet effective low-level vision MAE, referred to as LoMAE, tailored to address the LDCT denoising problem. Moreover, we introduce an MAE-GradCAM method to shed light on the latent learning mechanisms of the MAE/LoMAE. Additionally, we explore the LoMAE's robustness and generability across a variety of noise levels. Experiments results show that the proposed LoMAE can enhance the transformer's denoising performance and greatly relieve the dependence on the ground truth clean data. It also demonstrates remarkable robustness and generalizability over a spectrum of noise levels.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
A Glance is Enough: Extract Target Sentence By Looking at A keyword
Authors:
Ying Shi,
Dong Wang,
Lantian Li,
Jiqing Han
Abstract:
This paper investigates the possibility of extracting a target sentence from multi-talker speech using only a keyword as input. For example, in social security applications, the keyword might be "help", and the goal is to identify what the person who called for help is articulating while ignoring other speakers. To address this problem, we propose using the Transformer architecture to embed both t…
▽ More
This paper investigates the possibility of extracting a target sentence from multi-talker speech using only a keyword as input. For example, in social security applications, the keyword might be "help", and the goal is to identify what the person who called for help is articulating while ignoring other speakers. To address this problem, we propose using the Transformer architecture to embed both the keyword and the speech utterance and then rely on the cross-attention mechanism to select the correct content from the concatenated or overlap** speech. Experimental results on Librispeech demonstrate that our proposed method can effectively extract target sentences from very noisy and mixed speech (SNR=-3dB), achieving a phone error rate (PER) of 26\%, compared to the baseline system's PER of 96%.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.