-
Prioritized experience replay-based DDQN for Unmanned Vehicle Path Planning
Authors:
Liu Lipeng,
Letian Xu,
Jiabei Liu,
Haopeng Zhao,
Tongzhou Jiang,
Tianyao Zheng
Abstract:
Path planning module is a key module for autonomous vehicle navigation, which directly affects its operating efficiency and safety. In complex environments with many obstacles, traditional planning algorithms often cannot meet the needs of intelligence, which may lead to problems such as dead zones in unmanned vehicles. This paper proposes a path planning algorithm based on DDQN and combines it wi…
▽ More
Path planning module is a key module for autonomous vehicle navigation, which directly affects its operating efficiency and safety. In complex environments with many obstacles, traditional planning algorithms often cannot meet the needs of intelligence, which may lead to problems such as dead zones in unmanned vehicles. This paper proposes a path planning algorithm based on DDQN and combines it with the prioritized experience replay method to solve the problem that traditional path planning algorithms often fall into dead zones. A series of simulation experiment results prove that the path planning algorithm based on DDQN is significantly better than other methods in terms of speed and accuracy, especially the ability to break through dead zones in extreme environments. Research shows that the path planning algorithm based on DDQN performs well in terms of path quality and safety. These research results provide an important reference for the research on automatic navigation of autonomous vehicles.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction
Authors:
Haoqiu Yan,
Yongxin Zhu,
Kai Zheng,
Bing Liu,
Haoyu Cao,
Deqiang Jiang,
Linli Xu
Abstract:
Large Language Model (LLM)-enhanced agents become increasingly prevalent in Human-AI communication, offering vast potential from entertainment to professional domains. However, current multi-modal dialogue systems overlook the acoustic information present in speech, which is crucial for understanding human communication nuances. This oversight can lead to misinterpretations of speakers' intentions…
▽ More
Large Language Model (LLM)-enhanced agents become increasingly prevalent in Human-AI communication, offering vast potential from entertainment to professional domains. However, current multi-modal dialogue systems overlook the acoustic information present in speech, which is crucial for understanding human communication nuances. This oversight can lead to misinterpretations of speakers' intentions, resulting in inconsistent or even contradictory responses within dialogues. To bridge this gap, in this paper, we propose PerceptiveAgent, an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings beyond the literal interpretations of words through the integration of speech modality perception. Employing LLMs as a cognitive core, PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language. Experimental results indicate that PerceptiveAgent excels in contextual understanding by accurately discerning the speakers' true intentions in scenarios where the linguistic meaning is either contrary to or inconsistent with the speaker's true feelings, producing more nuanced and expressive spoken dialogues. Code is publicly available at: \url{https://github.com/Haoqiu-Yan/PerceptiveAgent}.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation
Authors:
Hanzhao Li,
Liumeng Xue,
Haohan Guo,
Xinfa Zhu,
Yuanjun Lv,
Lei Xie,
Yunlin Chen,
Hao Yin,
Zhifei Li
Abstract:
The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermor…
▽ More
The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-TTS experiments, showing improved naturalness and intelligibility.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark
Authors:
Linhan Ma,
Dake Guo,
Kun Song,
Yuepeng Jiang,
Shuai Wang,
Liumeng Xue,
Weiming Xu,
Huan Zhao,
Binbin Zhang,
Lei Xie
Abstract:
With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio…
▽ More
With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12,800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.
△ Less
Submitted 19 June, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
Text-aware and Context-aware Expressive Audiobook Speech Synthesis
Authors:
Dake Guo,
Xinfa Zhu,
Liumeng Xue,
Yongmao Zhang,
Wenjie Tian,
Lei Xie
Abstract:
Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech. To address this problem, we propose a text-aware and context-aware(TACA) style modeling approach…
▽ More
Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech. To address this problem, we propose a text-aware and context-aware(TACA) style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech style. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, VITS-based TTS and language model-based TTS. Experimental results demonstrate that our proposed approach can effectively capture diverse styles and coherent prosody, and consequently improves naturalness and expressiveness in audiobook speech synthesis.
△ Less
Submitted 12 June, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
Joint Association, Beamforming, and Resource Allocation for Multi-IRS Enabled MU-MISO Systems With RSMA
Authors:
Chunjie Wang,
Xuhui Zhang,
Huijun Xing,
Liang Xue,
Shuqiang Wang,
Yanyan Shen,
Bo Yang,
** Guan
Abstract:
Intelligent reflecting surface (IRS) and rate-splitting multiple access (RSMA) technologies are at the forefront of enhancing spectrum and energy efficiency in the next generation multi-antenna communication systems. This paper explores a RSMA system with multiple IRSs, and proposes two purpose-driven scheduling schemes, i.e., the exhaustive IRS-aided (EIA) and opportunistic IRS-aided (OIA) scheme…
▽ More
Intelligent reflecting surface (IRS) and rate-splitting multiple access (RSMA) technologies are at the forefront of enhancing spectrum and energy efficiency in the next generation multi-antenna communication systems. This paper explores a RSMA system with multiple IRSs, and proposes two purpose-driven scheduling schemes, i.e., the exhaustive IRS-aided (EIA) and opportunistic IRS-aided (OIA) schemes. The aim is to optimize the system weighted energy efficiency (EE) under the above two schemes, respectively. Specifically, the Dinkelbach, branch and bound, successive convex approximation, and the semidefinite relaxation methods are exploited within the alternating optimization framework to obtain effective solutions to the considered problems. The numerical findings indicate that the EIA scheme exhibits better performance compared to the OIA scheme in diverse scenarios when considering the weighted EE, and the proposed algorithm demonstrates superior performance in comparison to the baseline algorithms.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer
Authors:
Yongxin Zhu,
Dan Su,
Liqiang He,
Linli Xu,
Dong Yu
Abstract:
While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio wavef…
▽ More
While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities. By training on large corpora of speeches in an end-to-end unsupervised manner, GPST can generate syntactically consistent speech with diverse speaker identities. Given a brief 3-second prompt, GPST can produce natural and coherent personalized speech, demonstrating in-context learning abilities. Moreover, our approach can be easily extended to spoken cross-lingual speech generation by incorporating multi-lingual semantic tokens and universal acoustic tokens. Experimental results indicate that GPST significantly outperforms the existing speech language models in terms of word error rate, speech quality, and speaker similarity. See \url{https://youngsheen.github.io/GPST/demo} for demo samples.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Enhanced Automotive Radar Collaborative Sensing By Exploiting Constructive Interference
Authors:
Lifan Xu,
Shunqiao Sun,
A. Lee Swindlehurst
Abstract:
Automotive radar emerges as a crucial sensor for autonomous vehicle perception. As more cars are equipped radars, radar interference is an unavoidable challenge. Unlike conventional approaches such as interference mitigation and interference-avoiding technologies, this paper introduces an innovative collaborative sensing scheme with multiple automotive radars that exploits constructive interferenc…
▽ More
Automotive radar emerges as a crucial sensor for autonomous vehicle perception. As more cars are equipped radars, radar interference is an unavoidable challenge. Unlike conventional approaches such as interference mitigation and interference-avoiding technologies, this paper introduces an innovative collaborative sensing scheme with multiple automotive radars that exploits constructive interference. Through collaborative sensing, our method optimally aligns cross-path interference signals from other radars with another radar's self-echo signals, thereby significantly augmenting its target detection capabilities. This approach alleviates the need for extensive raw data sharing between collaborating radars. Instead, only an optimized weighting matrix needs to be exchanged between the radars. This approach considerably decreases the data bandwidth requirements for the wireless channel, making it a more feasible and practical solution for automotive radar collaboration. Numerical results demonstrate the effectiveness of the constructive interference approach for enhanced object detection capability.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
MRSegmentator: Robust Multi-Modality Segmentation of 40 Classes in MRI and CT Sequences
Authors:
Hartmut Häntze,
Lina Xu,
Felix J. Dorfner,
Leonhard Donle,
Daniel Truhn,
Hugo Aerts,
Mathias Prokop,
Bram van Ginneken,
Alessa Hering,
Lisa C. Adams,
Keno K. Bressem
Abstract:
Purpose: To introduce a deep learning model capable of multi-organ segmentation in MRI scans, offering a solution to the current limitations in MRI analysis due to challenges in resolution, standardized intensity values, and variability in sequences.
Materials and Methods: he model was trained on 1,200 manually annotated MRI scans from the UK Biobank, 221 in-house MRI scans and 1228 CT scans, le…
▽ More
Purpose: To introduce a deep learning model capable of multi-organ segmentation in MRI scans, offering a solution to the current limitations in MRI analysis due to challenges in resolution, standardized intensity values, and variability in sequences.
Materials and Methods: he model was trained on 1,200 manually annotated MRI scans from the UK Biobank, 221 in-house MRI scans and 1228 CT scans, leveraging cross-modality transfer learning from CT segmentation models. A human-in-the-loop annotation workflow was employed to efficiently create high-quality segmentations. The model's performance was evaluated on NAKO and the AMOS22 dataset containing 600 and 60 MRI examinations. Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) was used to assess segmentation accuracy. The model will be open sourced.
Results: The model showcased high accuracy in segmenting well-defined organs, achieving Dice Similarity Coefficient (DSC) scores of 0.97 for the right and left lungs, and 0.95 for the heart. It also demonstrated robustness in organs like the liver (DSC: 0.96) and kidneys (DSC: 0.95 left, 0.95 right), which present more variability. However, segmentation of smaller and complex structures such as the portal and splenic veins (DSC: 0.54) and adrenal glands (DSC: 0.65 left, 0.61 right) revealed the need for further model optimization.
Conclusion: The proposed model is a robust, tool for accurate segmentation of 40 anatomical structures in MRI and CT images. By leveraging cross-modality learning and interactive annotation, the model achieves strong performance and generalizability across diverse datasets, making it a valuable resource for researchers and clinicians. It is open source and can be downloaded from https://github.com/hhaentze/MRSegmentator.
△ Less
Submitted 13 May, 2024; v1 submitted 10 May, 2024;
originally announced May 2024.
-
Shifting the ISAC Trade-Off with Fluid Antenna Systems
Authors:
Jiaqi Zou,
Hao Xu,
Chao Wang,
Lvxin Xu,
Songlin Sun,
Kaitao Meng,
Christos Masouros,
Kai-Kit Wong
Abstract:
As an emerging antenna technology, a fluid antenna system (FAS) enhances spatial diversity to improve both sensing and communication performance by shifting the active antennas among available ports. In this letter, we study the potential of shifting the integrated sensing and communication (ISAC) trade-off with FAS. We propose the model for FAS-enabled ISAC and jointly optimize the transmit beamf…
▽ More
As an emerging antenna technology, a fluid antenna system (FAS) enhances spatial diversity to improve both sensing and communication performance by shifting the active antennas among available ports. In this letter, we study the potential of shifting the integrated sensing and communication (ISAC) trade-off with FAS. We propose the model for FAS-enabled ISAC and jointly optimize the transmit beamforming and port selection of FAS. In particular, we aim to minimize the transmit power, while satisfying both communication and sensing requirements. An efficient iterative algorithm based on sparse optimization, convex approximation, and a penalty approach is developed. The simulation results show that the proposed scheme can attain 33% reductions in transmit power with guaranteed sensing and communication performance, showing the great potential of the fluid antenna for striking a flexible tradeoff between sensing and communication in ISAC systems.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Improve Cross-Modality Segmentation by Treating MRI Images as Inverted CT Scans
Authors:
Hartmut Häntze,
Lina Xu,
Leonhard Donle,
Felix J. Dorfner,
Alessa Hering,
Lisa C. Adams,
Keno K. Bressem
Abstract:
Computed tomography (CT) segmentation models frequently include classes that are not currently supported by magnetic resonance imaging (MRI) segmentation models. In this study, we show that a simple image inversion technique can significantly improve the segmentation quality of CT segmentation models on MRI data, by using the TotalSegmentator model, applied to T1-weighted MRI images, as example. I…
▽ More
Computed tomography (CT) segmentation models frequently include classes that are not currently supported by magnetic resonance imaging (MRI) segmentation models. In this study, we show that a simple image inversion technique can significantly improve the segmentation quality of CT segmentation models on MRI data, by using the TotalSegmentator model, applied to T1-weighted MRI images, as example. Image inversion is straightforward to implement and does not require dedicated graphics processing units (GPUs), thus providing a quick alternative to complex deep modality-transfer models for generating segmentation masks for MRI data.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
Device Feature based on Graph Fourier Transformation with Logarithmic Processing For Detection of Replay Speech Attacks
Authors:
Mingrui He,
Longting Xu,
Han Wang,
Mingjun Zhang,
Rohan Kumar Das
Abstract:
The most common spoofing attacks on automatic speaker verification systems are replay speech attacks. Detection of replay speech heavily relies on replay configuration information. Previous studies have shown that graph Fourier transform-derived features can effectively detect replay speech but ignore device and environmental noise effects. In this work, we propose a new feature, the graph frequen…
▽ More
The most common spoofing attacks on automatic speaker verification systems are replay speech attacks. Detection of replay speech heavily relies on replay configuration information. Previous studies have shown that graph Fourier transform-derived features can effectively detect replay speech but ignore device and environmental noise effects. In this work, we propose a new feature, the graph frequency device cepstral coefficient, derived from the graph frequency domain using a device-related linear transformation. We also introduce two novel representations: graph frequency logarithmic coefficient and graph frequency logarithmic device coefficient. We evaluate our methods using traditional Gaussian mixture model and light convolutional neural network systems as classifiers. On the ASVspoof 2017 V2, ASVspoof 2019 physical access, and ASVspoof 2021 physical access datasets, our proposed features outperform known front-ends, demonstrating their effectiveness for replay speech detection.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder
Authors:
Yicheng Gu,
Xueyao Zhang,
Liumeng Xue,
Haizhou Li,
Zhizheng Wu
Abstract:
Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constan…
▽ More
Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, and CWT has a better modeling ability in short-time transients. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed discriminators. Moreover, the STFT, CQT, and CWT-based discriminators can be used jointly for better performance. The proposed discriminators can boost the synthesis quality of various state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
Beyond Alignment: Blind Video Face Restoration via Parsing-Guided Temporal-Coherent Transformer
Authors:
Kepeng Xu,
Li Xu,
Gang He,
Wenxin Yu,
Yunsong Li
Abstract:
Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and arti…
▽ More
Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and artifacts from pose changes and keypoint localization errors. To address this, we propose the first blind video face restoration approach with a novel parsing-guided temporal-coherent transformer (PGTFormer) without pre-alignment. PGTFormer leverages semantic parsing guidance to select optimal face priors for generating temporally coherent artifact-free results. Specifically, we pre-train a temporal-spatial vector quantized auto-encoder on high-quality video face datasets to extract expressive context-rich priors. Then, the temporal parse-guided codebook predictor (TPCP) restores faces in different poses based on face parsing context cues without performing face pre-alignment. This strategy reduces artifacts and mitigates jitter caused by cumulative errors from face pre-alignment. Finally, the temporal fidelity regulator (TFR) enhances fidelity through temporal feature interaction and improves video temporal consistency. Extensive experiments on face videos show that our method outperforms previous face restoration baselines. The code will be released on \href{https://github.com/kepengxu/PGTFormer}{https://github.com/kepengxu/PGTFormer}.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
On-chip Real-time Hyperspectral Imager with Full CMOS Resolution Enabled by Massively Parallel Neural Network
Authors:
Junren Wen,
Haiqi Gao,
Weiming Shi,
Shuaibo Feng,
Lingyun Hao,
Yujie Liu,
Liang Xu,
Yuchuan Shao,
Yueguang Zhang,
Weidong Shen,
Chenying Yang
Abstract:
Traditional spectral imaging methods are constrained by the time-consuming scanning process, limiting the application in dynamic scenarios. One-shot spectral imaging based on reconstruction has been a hot research topic recently and the primary challenges still lie in both efficient fabrication techniques suitable for mass production and the high-speed, high-accuracy reconstruction algorithm for r…
▽ More
Traditional spectral imaging methods are constrained by the time-consuming scanning process, limiting the application in dynamic scenarios. One-shot spectral imaging based on reconstruction has been a hot research topic recently and the primary challenges still lie in both efficient fabrication techniques suitable for mass production and the high-speed, high-accuracy reconstruction algorithm for real-time spectral imaging. In this study, we introduce an innovative on-chip real-time hyperspectral imager that leverages nanophotonic film spectral encoders and a Massively Parallel Network (MP-Net), featuring a 4 * 4 array of compact, all-dielectric film units for the micro-spectrometers. Each curved nanophotonic film unit uniquely modulates incident light across the underlying 3 * 3 CMOS image sensor (CIS) pixels, enabling a high spatial resolution equivalent to the full CMOS resolution. The implementation of MP-Net, specially designed to address variability in transmittance and manufacturing errors such as misalignment and non-uniformities in thin film deposition, can greatly increase the structural tolerance of the device and reduce the preparation requirement, further simplifying the manufacturing process. Tested in varied environments on both static and moving objects, the real-time hyperspectral imager demonstrates the robustness and high-fidelity spatial-spectral data capabilities across diverse scenarios. This on-chip hyperspectral imager represents a significant advancement in real-time, high-resolution spectral imaging, offering a versatile solution for applications ranging from environmental monitoring, remote sensing to consumer electronics.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation
Authors:
Yin Li,
Qi Chen,
Kai Wang,
Meige Li,
Li** Si,
Yingwei Guo,
Yu Xiong,
Qixing Wang,
Yang Qin,
Ling Xu,
Patrick van der Smagt,
Jun Tang,
Nutan Chen
Abstract:
Multi-modality magnetic resonance imaging data with various sequences facilitate the early diagnosis, tumor segmentation, and disease staging in the management of nasopharyngeal carcinoma (NPC). The lack of publicly available, comprehensive datasets limits advancements in diagnosis, treatment planning, and the development of machine learning algorithms for NPC. Addressing this critical need, we in…
▽ More
Multi-modality magnetic resonance imaging data with various sequences facilitate the early diagnosis, tumor segmentation, and disease staging in the management of nasopharyngeal carcinoma (NPC). The lack of publicly available, comprehensive datasets limits advancements in diagnosis, treatment planning, and the development of machine learning algorithms for NPC. Addressing this critical need, we introduce the first comprehensive NPC MRI dataset, encompassing MR axial imaging of 277 primary NPC patients. This dataset includes T1-weighted, T2-weighted, and contrast-enhanced T1-weighted sequences, totaling 831 scans. In addition to the corresponding clinical data, manually annotated and labeled segmentations by experienced radiologists offer high-quality data resources from untreated primary NPC.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Pneumonia App: a mobile application for efficient pediatric pneumonia diagnosis using explainable convolutional neural networks (CNN)
Authors:
Jiaming Deng,
Zhenglin Chen,
Minjiang Chen,
Lulu Xu,
Jiaqi Yang,
Zhendong Luo,
Peiwu Qin
Abstract:
Mycoplasma pneumoniae pneumonia (MPP) poses significant diagnostic challenges in pediatric healthcare, especially in regions like China where it's prevalent. We introduce PneumoniaAPP, a mobile application leveraging deep learning techniques for rapid MPP detection. Our approach capitalizes on convolutional neural networks (CNNs) trained on a comprehensive dataset comprising 3345 chest X-ray (CXR)…
▽ More
Mycoplasma pneumoniae pneumonia (MPP) poses significant diagnostic challenges in pediatric healthcare, especially in regions like China where it's prevalent. We introduce PneumoniaAPP, a mobile application leveraging deep learning techniques for rapid MPP detection. Our approach capitalizes on convolutional neural networks (CNNs) trained on a comprehensive dataset comprising 3345 chest X-ray (CXR) images, which includes 833 CXR images revealing MPP and additionally augmented with samples from a public dataset. The CNN model achieved an accuracy of 88.20% and an AUROC of 0.9218 across all classes, with a specific accuracy of 97.64% for the mycoplasma class, as demonstrated on the testing dataset. Furthermore, we integrated explainability techniques into PneumoniaAPP to aid respiratory physicians in lung opacity localization. Our contribution extends beyond existing research by targeting pediatric MPP, emphasizing the age group of 0-12 years, and prioritizing deployment on mobile devices. This work signifies a significant advancement in pediatric pneumonia diagnosis, offering a reliable and accessible tool to alleviate diagnostic burdens in healthcare settings.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Neural Exponential Stabilization of Control-affine Nonlinear Systems
Authors:
Muhammad Zakwan,
Liang Xu,
Giancarlo Ferrari-Trecate
Abstract:
This paper proposes a novel learning-based approach for achieving exponential stabilization of nonlinear control-affine systems. We leverage the Control Contraction Metrics (CCMs) framework to co-synthesize Neural Contraction Metrics (NCMs) and Neural Network (NN) controllers. First, we transform the infinite-dimensional semi-definite program (SDP) for CCM computation into a tractable inequality f…
▽ More
This paper proposes a novel learning-based approach for achieving exponential stabilization of nonlinear control-affine systems. We leverage the Control Contraction Metrics (CCMs) framework to co-synthesize Neural Contraction Metrics (NCMs) and Neural Network (NN) controllers. First, we transform the infinite-dimensional semi-definite program (SDP) for CCM computation into a tractable inequality feasibility problem using element-wise bounds of matrix-valued functions. The terms in the inequality can be efficiently computed by our novel algorithms. Second, we propose a free parametrization of NCMs guaranteeing positive definiteness and the satisfaction of a partial differential equation, regardless of trainable parameters. Third, this parametrization and the inequality condition enable the design of contractivity-enforcing regularizers, which can be incorporated while designing the NN controller for exponential stabilization of the underlying nonlinear systems. Furthermore, when the training loss goes to zero, we provide formal guarantees on verification of the NCM and the exponentional stabilization under the NN controller. Finally, we validate our method through benchmark experiments on set-point stabilization and increasing the region of attraction of a locally pre-stabilized closed-loop system.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
RadioGAT: A Joint Model-based and Data-driven Framework for Multi-band Radiomap Reconstruction via Graph Attention Networks
Authors:
Xiaojie Li,
Songyang Zhang,
Hang Li,
Xiaoyang Li,
Lexi Xu,
Haigao Xu,
Hui Mei,
Guangxu Zhu,
Nan Qi,
Ming Xiao
Abstract:
Multi-band radiomap reconstruction (MB-RMR) is a key component in wireless communications for tasks such as spectrum management and network planning. However, traditional machine-learning-based MB-RMR methods, which rely heavily on simulated data or complete structured ground truth, face significant deployment challenges. These challenges stem from the differences between simulated and actual data…
▽ More
Multi-band radiomap reconstruction (MB-RMR) is a key component in wireless communications for tasks such as spectrum management and network planning. However, traditional machine-learning-based MB-RMR methods, which rely heavily on simulated data or complete structured ground truth, face significant deployment challenges. These challenges stem from the differences between simulated and actual data, as well as the scarcity of real-world measurements. To address these challenges, our study presents RadioGAT, a novel framework based on Graph Attention Network (GAT) tailored for MB-RMR within a single area, eliminating the need for multi-region datasets. RadioGAT innovatively merges model-based spatial-spectral correlation encoding with data-driven radiomap generalization, thus minimizing the reliance on extensive data sources. The framework begins by transforming sparse multi-band data into a graph structure through an innovative encoding strategy that leverages radio propagation models to capture the spatial-spectral correlation inherent in the data. This graph-based representation not only simplifies data handling but also enables tailored label sampling during training, significantly enhancing the framework's adaptability for deployment. Subsequently, The GAT is employed to generalize the radiomap information across various frequency bands. Extensive experiments using raytracing datasets based on real-world environments have demonstrated RadioGAT's enhanced accuracy in supervised learning settings and its robustness in semi-supervised scenarios. These results underscore RadioGAT's effectiveness and practicality for MB-RMR in environments with limited data availability.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Energy-Efficient Hybrid Beamforming with Dynamic On-off Control for Integrated Sensing, Communications, and Powering
Authors:
Zeyu Hao,
Yuan Fang,
Xianghao Yu,
Jie Xu,
Ling Qiu,
Lexi Xu,
Shuguang Cui
Abstract:
This paper investigates the energy-efficient hybrid beamforming design for a multi-functional integrated sensing, communications, and powering (ISCAP) system. In this system, a base station (BS) with a hybrid analog-digital (HAD) architecture sends unified wireless signals to communicate with multiple information receivers (IRs), sense multiple point targets, and wirelessly charge multiple energy…
▽ More
This paper investigates the energy-efficient hybrid beamforming design for a multi-functional integrated sensing, communications, and powering (ISCAP) system. In this system, a base station (BS) with a hybrid analog-digital (HAD) architecture sends unified wireless signals to communicate with multiple information receivers (IRs), sense multiple point targets, and wirelessly charge multiple energy receivers (ERs) at the same time. To facilitate the energy-efficient design, we present a novel HAD architecture for the BS transmitter, which allows dynamic on-off control of its radio frequency (RF) chains and analog phase shifters (PSs) through a switch network. We also consider a practical and comprehensive power consumption model for the BS, by taking into account the power-dependent non-linear power amplifier (PA) efficiency, and the on-off non-transmission power consumption model of RF chains and PSs. We jointly design the hybrid beamforming and dynamic on-off control at the BS, aiming to minimize its total power consumption, while guaranteeing the performance requirements on communication rates, sensing Cramér-Rao bound (CRB), and harvested power levels. The formulation also takes into consideration the per-antenna transmit power constraint and the constant modulus constraints for the analog beamformer at the BS. The resulting optimization problem for ISCAP is highly non-convex. Please refer to the paper for a complete abstract.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Task-Oriented Hybrid Beamforming for OFDM-DFRC Systems with Flexibly Controlled Space-Frequency Spectra
Authors:
Lingyun Xu,
Bowen Wang,
Ziyang Cheng
Abstract:
This paper investigates the issues of the hybrid beamforming design for the orthogonal frequency division multiplexing dual-function radar-communication (DFRC) system in multiple task scenarios involving the radar scanning and detection task and the target tracking task. To meet different task requirements of the DFRC system, we introduce two novel radar beampattern metrics, the average integrated…
▽ More
This paper investigates the issues of the hybrid beamforming design for the orthogonal frequency division multiplexing dual-function radar-communication (DFRC) system in multiple task scenarios involving the radar scanning and detection task and the target tracking task. To meet different task requirements of the DFRC system, we introduce two novel radar beampattern metrics, the average integrated sidelobe to minimum mainlobe ratio (AISMMR) and average peak sidelobe to integrated mainlobe ratio (APSIMR), to characterize the space-frequency spectra in different scenarios. Then, two HBF design problems are formulated for two task scenarios by minimizing the AISMMR and APSIMR respectively subject to the constraints of communication quality-of-service (QoS), power budget, and hardware. Due to the non-linearity and close coupling between the analog and digital beamformers in both the objective functions and QoS constraint, the resultant formulated problems are challenging to solve. Towards that end, a unified optimization algorithm based on a consensus alternating direction method of multipliers (CADMM) is proposed to solve these two problems. Moreover, under the unified CADMM framework, the closed-form solutions of primal variables in the original two problems are obtained with low complexity. Numerical simulations are provided to demonstrate the feasibility and effectiveness of the proposed algorithm.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Enhancing Physical Layer Security in Dual-Function Radar-Communication Systems with Hybrid Beamforming Architecture
Authors:
Lingyun Xu,
Bowen Wang,
Huiyong Li,
Ziyang Cheng
Abstract:
In this letter, we investigate enhancing the physical layer security (PLS) for the dual-function radar-communication (DFRC) system with hybrid beamforming (HBF) architecture, where the base station (BS) achieves downlink communication and radar target detection simultaneously. We consider an eavesdropper intercepting the information transmitted from the BS to the downlink communication users with…
▽ More
In this letter, we investigate enhancing the physical layer security (PLS) for the dual-function radar-communication (DFRC) system with hybrid beamforming (HBF) architecture, where the base station (BS) achieves downlink communication and radar target detection simultaneously. We consider an eavesdropper intercepting the information transmitted from the BS to the downlink communication users with imperfectly known channel state information. Additionally, the location of the radar target is also imperfectly known by the BS. To enhance PLS in the considered DFRC system, we propose a novel HBF architecture, which introduces a new integrated sensing and security (I2S) symbol. The secure HBF design problem for DFRC is formulated by maximizing the minimum legitimate user communication rate subject to radar signal-to-interference-plus-noise ratio, eavesdrop** rate, hardware and power constraints. To solve this non-convex problem, we propose an alternating optimization based method to jointly optimize transmit and receive beamformers. Numerical simulation results validate the effectiveness of the proposed algorithm and show the superiority of the proposed I2S-aided HBF architecture for achieving DFRC and enhancing PLS.
△ Less
Submitted 4 April, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
Structured Deep Neural Networks-Based Backstep** Trajectory Tracking Control for Lagrangian Systems
Authors:
Jiajun Qian,
Liang Xu,
Xiaoqiang Ren,
Xiaofan Wang
Abstract:
Deep neural networks (DNN) are increasingly being used to learn controllers due to their excellent approximation capabilities. However, their black-box nature poses significant challenges to closed-loop stability guarantees and performance analysis. In this paper, we introduce a structured DNN-based controller for the trajectory tracking control of Lagrangian systems using backing techniques. By p…
▽ More
Deep neural networks (DNN) are increasingly being used to learn controllers due to their excellent approximation capabilities. However, their black-box nature poses significant challenges to closed-loop stability guarantees and performance analysis. In this paper, we introduce a structured DNN-based controller for the trajectory tracking control of Lagrangian systems using backing techniques. By properly designing neural network structures, the proposed controller can ensure closed-loop stability for any compatible neural network parameters. In addition, improved control performance can be achieved by further optimizing neural network parameters. Besides, we provide explicit upper bounds on tracking errors in terms of controller parameters, which allows us to achieve the desired tracking performance by properly selecting the controller parameters. Furthermore, when system models are unknown, we propose an improved Lagrangian neural network (LNN) structure to learn the system dynamics and design the controller. We show that in the presence of model approximation errors and external disturbances, the closed-loop stability and tracking control performance can still be guaranteed. The effectiveness of the proposed approach is demonstrated through simulations.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
Emergency Caching: Coded Caching-based Reliable Map Transmission in Emergency Networks
Authors:
Zeyu Tian,
Lianming Xu,
Liang Li,
Li Wang,
Aiguo Fei
Abstract:
Many rescue missions demand effective perception and real-time decision making, which highly rely on effective data collection and processing. In this study, we propose a three-layer architecture of emergency caching networks focusing on data collection and reliable transmission, by leveraging efficient perception and edge caching technologies. Based on this architecture, we propose a disaster map…
▽ More
Many rescue missions demand effective perception and real-time decision making, which highly rely on effective data collection and processing. In this study, we propose a three-layer architecture of emergency caching networks focusing on data collection and reliable transmission, by leveraging efficient perception and edge caching technologies. Based on this architecture, we propose a disaster map collection framework that integrates coded caching technologies. Our framework strategically caches coded fragments of maps across unmanned aerial vehicles (UAVs), fostering collaborative uploading for augmented transmission reliability. Additionally, we establish a comprehensive probability model to assess the effective recovery area of disaster maps. Towards the goal of utility maximization, we propose a deep reinforcement learning (DRL) based algorithm that jointly makes decisions about cooperative UAVs selection, bandwidth allocation and coded caching parameter adjustment, accommodating the real-time map updates in a dynamic disaster situation. Our proposed scheme is more effective than the non-coding caching scheme, as validated by simulation.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Authors:
Ruibin Yuan,
Hanfeng Lin,
Yi Wang,
Zeyue Tian,
Shangda Wu,
Tianhao Shen,
Ge Zhang,
Yuhang Wu,
Cong Liu,
Ziya Zhou,
Ziyang Ma,
Liumeng Xue,
Ziyu Wang,
Qin Liu,
Tianyu Zheng,
Yizhi Li,
Yinghao Ma,
Yiming Liang,
Xiaowei Chi,
Ruibo Liu,
Zili Wang,
Pengfei Li,
**gcheng Wu,
Chenghua Lin,
Qifeng Liu
, et al. (10 additional authors not shown)
Abstract:
While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the…
▽ More
While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion
Authors:
Liumeng Xue,
Chaoren Wang,
Mingxuan Wang,
Xueyao Zhang,
Jun Han,
Zhizheng Wu
Abstract:
In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a visual display of the generation process in diffusion models, showcasing the step-by-step denoising of the noisy spectrum and its transformation into a clean spectrum that captures the desired singer's timbre. The system also fac…
▽ More
In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a visual display of the generation process in diffusion models, showcasing the step-by-step denoising of the noisy spectrum and its transformation into a clean spectrum that captures the desired singer's timbre. The system also facilitates side-by-side comparisons of different conditions, such as source content, melody, and target timbre, highlighting the impact of these conditions on the diffusion generation process and resulting conversions. Through comprehensive evaluations, SingVisio demonstrates its effectiveness in terms of system design, functionality, explainability, and user-friendliness. It offers users of various backgrounds valuable learning experiences and insights into the diffusion model for singing voice conversion.
△ Less
Submitted 19 February, 2024;
originally announced February 2024.
-
Deep Rib Fracture Instance Segmentation and Classification from CT on the RibFrac Challenge
Authors:
Jiancheng Yang,
Rui Shi,
Liang **,
Xiaoyang Huang,
Kaiming Kuang,
Donglai Wei,
Shixuan Gu,
Jianying Liu,
Pengfei Liu,
Zhizhong Chai,
Yongjie Xiao,
Hao Chen,
Liming Xu,
Bang Du,
Xiangyi Yan,
Hao Tang,
Adam Alessio,
Gregory Holste,
Jiapeng Zhang,
Xiaoming Wang,
Jianye He,
Lixuan Che,
Hanspeter Pfister,
Ming Li,
Bingbing Ni
Abstract:
Rib fractures are a common and potentially severe injury that can be challenging and labor-intensive to detect in CT scans. While there have been efforts to address this field, the lack of large-scale annotated datasets and evaluation benchmarks has hindered the development and validation of deep learning algorithms. To address this issue, the RibFrac Challenge was introduced, providing a benchmar…
▽ More
Rib fractures are a common and potentially severe injury that can be challenging and labor-intensive to detect in CT scans. While there have been efforts to address this field, the lack of large-scale annotated datasets and evaluation benchmarks has hindered the development and validation of deep learning algorithms. To address this issue, the RibFrac Challenge was introduced, providing a benchmark dataset of over 5,000 rib fractures from 660 CT scans, with voxel-level instance mask annotations and diagnosis labels for four clinical categories (buckle, nondisplaced, displaced, or segmental). The challenge includes two tracks: a detection (instance segmentation) track evaluated by an FROC-style metric and a classification track evaluated by an F1-style metric. During the MICCAI 2020 challenge period, 243 results were evaluated, and seven teams were invited to participate in the challenge summary. The analysis revealed that several top rib fracture detection solutions achieved performance comparable or even better than human experts. Nevertheless, the current rib fracture classification solutions are hardly clinically applicable, which can be an interesting area in the future. As an active benchmark and research resource, the data and online evaluation of the RibFrac Challenge are available at the challenge website. As an independent contribution, we have also extended our previous internal baseline by incorporating recent advancements in large-scale pretrained networks and point-based rib segmentation techniques. The resulting FracNet+ demonstrates competitive performance in rib fracture detection, which lays a foundation for further research and development in AI-assisted rib fracture detection and diagnosis.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
Revisiting Generative Adversarial Networks for Binary Semantic Segmentation on Imbalanced Datasets
Authors:
Lei Xu,
Moncef Gabbouj
Abstract:
Anomalous crack region detection is a typical binary semantic segmentation task, which aims to detect pixels representing cracks on pavement surface images automatically by algorithms. Although existing deep learning-based methods have achieved outcoming results on specific public pavement datasets, the performance would deteriorate dramatically on imbalanced datasets. The input datasets used in s…
▽ More
Anomalous crack region detection is a typical binary semantic segmentation task, which aims to detect pixels representing cracks on pavement surface images automatically by algorithms. Although existing deep learning-based methods have achieved outcoming results on specific public pavement datasets, the performance would deteriorate dramatically on imbalanced datasets. The input datasets used in such tasks suffer from severely between-class imbalanced problems, hence, it is a core challenge to obtain a robust performance on diverse pavement datasets with generic deep learning models. To address this problem, in this work, we propose a deep learning framework based on conditional Generative Adversarial Networks (cGANs) for the anomalous crack region detection tasks at the pixel level. In particular, the proposed framework containing a cGANs and a novel auxiliary network is developed to enhance and stabilize the generator's performance under two alternative training stages, when estimating a multiscale probability feature map from heterogeneous and imbalanced inputs iteratively. Moreover, several attention mechanisms and entropy strategies are incorporated into the cGANs architecture and the auxiliary network separately to mitigate further the performance deterioration of model training on severely imbalanced datasets. We implement extensive experiments on six accessible pavement datasets. The experimental results from both visual and quantitative evaluation show that the proposed framework can achieve state-of-the-art results on these datasets efficiently and robustly without acceleration of computation complexity.
△ Less
Submitted 7 March, 2024; v1 submitted 3 February, 2024;
originally announced February 2024.
-
Emergency Computing: An Adaptive Collaborative Inference Method Based on Hierarchical Reinforcement Learning
Authors:
Weiqi Fu,
Lianming Xu,
Xin Wu,
Li Wang,
Aiguo Fei
Abstract:
In achieving effective emergency response, the timely acquisition of environmental information, seamless command data transmission, and prompt decision-making are crucial. This necessitates the establishment of a resilient emergency communication dedicated network, capable of providing communication and sensing services even in the absence of basic infrastructure. In this paper, we propose an Emer…
▽ More
In achieving effective emergency response, the timely acquisition of environmental information, seamless command data transmission, and prompt decision-making are crucial. This necessitates the establishment of a resilient emergency communication dedicated network, capable of providing communication and sensing services even in the absence of basic infrastructure. In this paper, we propose an Emergency Network with Sensing, Communication, Computation, Caching, and Intelligence (E-SC3I). The framework incorporates mechanisms for emergency computing, caching, integrated communication and sensing, and intelligence empowerment. E-SC3I ensures rapid access to a large user base, reliable data transmission over unstable links, and dynamic network deployment in a changing environment. However, these advantages come at the cost of significant computation overhead. Therefore, we specifically concentrate on emergency computing and propose an adaptive collaborative inference method (ACIM) based on hierarchical reinforcement learning. Experimental results demonstrate our method's ability to achieve rapid inference of AI models with constrained computational and communication resources.
△ Less
Submitted 3 February, 2024;
originally announced February 2024.
-
An Intra-BRNN and GB-RVQ Based END-TO-END Neural Audio Codec
Authors:
Lin** Xu,
Jiawei Jiang,
Dejun Zhang,
Xianjun Xia,
Li Chen,
Yijian Xiao,
Piao Ding,
Shenyi Song,
Sixing Yin,
Ferdous Sohel
Abstract:
Recently, neural networks have proven to be effective in performing speech coding task at low bitrates. However, under-utilization of intra-frame correlations and the error of quantizer specifically degrade the reconstructed audio quality. To improve the coding quality, we present an end-to-end neural speech codec, namely CBRC (Convolutional and Bidirectional Recurrent neural Codec). An interleave…
▽ More
Recently, neural networks have proven to be effective in performing speech coding task at low bitrates. However, under-utilization of intra-frame correlations and the error of quantizer specifically degrade the reconstructed audio quality. To improve the coding quality, we present an end-to-end neural speech codec, namely CBRC (Convolutional and Bidirectional Recurrent neural Codec). An interleaved structure using 1D-CNN and Intra-BRNN is designed to exploit the intra-frame correlations more efficiently. Furthermore, Group-wise and Beam-search Residual Vector Quantizer (GB-RVQ) is used to reduce the quantization noise. CBRC encodes audio every 20ms with no additional latency, which is suitable for real-time communication. Experimental results demonstrate the superiority of the proposed codec when comparing CBRC at 3kbps with Opus at 12kbps.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Dual-Tap Optical-Digital Feedforward Equalization Enabling High-Speed Optical Transmission in IM/DD Systems
Authors:
Yu Guo,
Yangbo Wu,
Zhao Yang,
Lei Xue,
Ning Liang,
Yang Ren,
Zhengrui Tu,
Jia Feng,
Qunbi Zhuge
Abstract:
Intensity-modulation and direct-detection (IM/DD) transmission is widely adopted for high-speed optical transmission scenarios due to its cost-effectiveness and simplicity. However, as the data rate increases, the fiber chromatic dispersion (CD) would induce a serious power fading effect, and direct detection could generate inter-symbol interference (ISI). Moreover, the ISI becomes more severe wit…
▽ More
Intensity-modulation and direct-detection (IM/DD) transmission is widely adopted for high-speed optical transmission scenarios due to its cost-effectiveness and simplicity. However, as the data rate increases, the fiber chromatic dispersion (CD) would induce a serious power fading effect, and direct detection could generate inter-symbol interference (ISI). Moreover, the ISI becomes more severe with the increase of fiber length, thereby highly restricting the transmission distance of IM/DD systems. This paper proposes a dual-tap optical-digital feedforward equalization (DT-ODFE) scheme, which could effectively compensate for CD-induced power fading while maintaining low cost and simplicity. A theoretical channel response is formulated for IM/DD transmission, incorporating a dual-tap optical equalizer, and the theoretical analysis reveals that for an IM/DD transmission using 1371nm over 10km standard single-mode fiber (SSMF), frequency notch is removed from 33.7GHz to 46GHz. Simulation results show that the DT- ODFE achieves an SNR gain of 2.3dB over IM/DD systems with symbol-space feedforward equalizer (FFE) alone. As the fiber length increases to 15 km, DT- ODFE performs well, while FFE, decision-feedback equalizer (DFE) and Volterra nonlinear equalizers (VNLE) all fail to compensate for the power fading and the 7% hard-decision FEC limit is not satisfied. For 200 Gb/s/$λ$ PAM-4 over 15km SSMF, results show that the signal-to-noise ratio (SNR) of the proposed DT- ODFE with optimal coefficients satisfies the 7% hard-decision FEC limit, which uncovers the great potential of the DT- ODFE for high-speed IM/DD systems in LR/FR scenarios.
△ Less
Submitted 1 February, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Data and Physics driven Deep Learning Models for Fast MRI Reconstruction: Fundamentals and Methodologies
Authors:
Jiahao Huang,
Yinzhe Wu,
Fanwen Wang,
Yingying Fang,
Yang Nan,
Cagan Alkan,
Lei Xu,
Zhifan Gao,
Weiwen Wu,
Lei Zhu,
Zhaolin Chen,
Peter Lally,
Neal Bangerter,
Kawin Setsompop,
Yike Guo,
Daniel Rueckert,
Ge Wang,
Guang Yang
Abstract:
Magnetic Resonance Imaging (MRI) is a pivotal clinical diagnostic tool, yet its extended scanning times often compromise patient comfort and image quality, especially in volumetric, temporal and quantitative scans. This review elucidates recent advances in MRI acceleration via data and physics-driven models, leveraging techniques from algorithm unrolling models, enhancement-based models, and plug-…
▽ More
Magnetic Resonance Imaging (MRI) is a pivotal clinical diagnostic tool, yet its extended scanning times often compromise patient comfort and image quality, especially in volumetric, temporal and quantitative scans. This review elucidates recent advances in MRI acceleration via data and physics-driven models, leveraging techniques from algorithm unrolling models, enhancement-based models, and plug-and-play models to emergent full spectrum of generative models. We also explore the synergistic integration of data models with physics-based insights, encompassing the advancements in multi-coil hardware accelerations like parallel imaging and simultaneous multi-slice imaging, and the optimization of sampling patterns. We then focus on domain-specific challenges and opportunities, including image redundancy exploitation, image integrity, evaluation metrics, data heterogeneity, and model generalization. This work also discusses potential solutions and future research directions, emphasizing the role of data harmonization, and federated learning for further improving the general applicability and performance of these methods in MRI reconstruction.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Towards Autonomous Supply Chains: Definition, Characteristics, Conceptual Framework, and Autonomy Levels
Authors:
Liming Xu,
Stephen Mak,
Yaniv Proselkov,
Alexandra Brintrup
Abstract:
Recent global disruptions, such as the pandemic and geopolitical conflicts, have profoundly exposed vulnerabilities in traditional supply chains, requiring exploration of more resilient alternatives. Autonomous supply chains (ASCs) have emerged as a potential solution, offering increased visibility, flexibility, and resilience in turbulent trade environments. Despite discussions in industry and ac…
▽ More
Recent global disruptions, such as the pandemic and geopolitical conflicts, have profoundly exposed vulnerabilities in traditional supply chains, requiring exploration of more resilient alternatives. Autonomous supply chains (ASCs) have emerged as a potential solution, offering increased visibility, flexibility, and resilience in turbulent trade environments. Despite discussions in industry and academia over several years, ASCs lack well-established theoretical foundations. This paper addresses this research gap by presenting a formal definition of ASC along with its defining characteristics and auxiliary concepts. We propose a layered conceptual framework called the MIISI model. An illustrative case study focusing on the meat supply chain demonstrates an initial ASC implementation based on this conceptual model. Additionally, we introduce a seven-level supply chain autonomy reference model, delineating a trajectory towards achieving a full supply chain autonomy. Recognising that this work represents an initial endeavour, we emphasise the need for continued exploration in this emerging domain. We anticipate that this work will stimulate further research, both theoretical and technical, and contribute to the continual evolution of ASCs.
△ Less
Submitted 13 October, 2023;
originally announced January 2024.
-
Force sensing to reconstruct potential energy landscapes for cluttered large obstacle traversal
Authors:
Yaqing Wang,
Ling Xu,
Chen Li
Abstract:
Visual sensing of environmental geometry allows robots to use artificial potential fields to avoid sparse obstacles. Yet robots must further traverse cluttered large obstacles for applications like search and rescue through rubble and planetary exploration across Martain rocks. Recent studies discovered that to traverse cluttered large obstacles, multi-legged insects and insect-inspired robots mak…
▽ More
Visual sensing of environmental geometry allows robots to use artificial potential fields to avoid sparse obstacles. Yet robots must further traverse cluttered large obstacles for applications like search and rescue through rubble and planetary exploration across Martain rocks. Recent studies discovered that to traverse cluttered large obstacles, multi-legged insects and insect-inspired robots make strenuous transitions across locomotor modes with major changes in body orientation. When viewed on a potential energy landscape resulting from locomotor-obstacle physical interaction, these are barrier-crossing transitions across landscape basins. This potential energy landscape approach may provide a modeling framework for cluttered large obstacle traversal. Here, we take the next step toward this vision by testing whether force sensing allows the reconstruction of the potential energy landscape. We developed a cockroach-inspired, minimalistic robot capable of sensing obstacle contact forces and torques around its body as it propelled forward against a pair of cluttered grass-like beam obstacles. We performed measurements over many traverses with systematically varied body orientations. Despite the forces and torques not being fully conservative, they well-matched the potential energy landscape gradients and the landscape reconstructed from them well-matched ground truth. In addition, inspired by cockroach observations, we found that robot head oscillation during traversal further improved the accuracies of force sensing and landscape reconstruction. We still need to study how to reconstruct landscape during a single traverse, as in applications, robots have little chance to use multiple traverses to sample the environment systematically and how to find landscape saddles for least-effort transitions to traverse.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
Susceptibility of Adversarial Attack on Medical Image Segmentation Models
Authors:
Zhongxuan Wang,
Leo Xu
Abstract:
The nature of deep neural networks has given rise to a variety of attacks, but little work has been done to address the effect of adversarial attacks on segmentation models trained on MRI datasets. In light of the grave consequences that such attacks could cause, we explore four models from the U-Net family and examine their responses to the Fast Gradient Sign Method (FGSM) attack. We conduct FGSM…
▽ More
The nature of deep neural networks has given rise to a variety of attacks, but little work has been done to address the effect of adversarial attacks on segmentation models trained on MRI datasets. In light of the grave consequences that such attacks could cause, we explore four models from the U-Net family and examine their responses to the Fast Gradient Sign Method (FGSM) attack. We conduct FGSM attacks on each of them and experiment with various schemes to conduct the attacks. In this paper, we find that medical imaging segmentation models are indeed vulnerable to adversarial attacks and that there is a negligible correlation between parameter size and adversarial attack success. Furthermore, we show that using a different loss function than the one used for training yields higher adversarial attack success, contrary to what the FGSM authors suggested. In future efforts, we will conduct the experiments detailed in this paper with more segmentation models and different attacks. We will also attempt to find ways to counteract the attacks by using model ensembles or special data augmentations. Our code is available at https://github.com/ZhongxuanWang/adv_attk
△ Less
Submitted 20 January, 2024;
originally announced January 2024.
-
Hazard resistance-based spatiotemporal risk analysis for distribution network outages during hurricanes
Authors:
Luo Xu,
Ning Lin,
Dazhi Xi,
Kairui Feng,
H. Vincent Poor
Abstract:
Blackouts in recent decades show an increasing prevalence of power outages due to extreme weather events such as hurricanes. Precisely assessing the spatiotemporal outages in distribution networks, the most vulnerable part of power systems, is critical to enhance power system resilience. The Sequential Monte Carlo (SMC) simulation method is widely used for spatiotemporal risk analysis of power sys…
▽ More
Blackouts in recent decades show an increasing prevalence of power outages due to extreme weather events such as hurricanes. Precisely assessing the spatiotemporal outages in distribution networks, the most vulnerable part of power systems, is critical to enhance power system resilience. The Sequential Monte Carlo (SMC) simulation method is widely used for spatiotemporal risk analysis of power systems during extreme weather hazards. However, it is found here that the SMC method can lead to large errors by directly applying the fragility function or failure probability of system components in time-sequential analysis, particularly overestimating damages under evolving hazards with high-frequency sampling. To address this issue, a novel hazard resistance-based spatiotemporal risk analysis (HRSRA) method is proposed. This method converts the time-varying failure probability of a component into a hazard resistance as a time-invariant value during the simulation of evolving hazards. The proposed HRSRA provides an adaptive framework for incorporating high-spatiotemporal-resolution meteorology models into power outage simulations. By leveraging the geographic information system data of the power system and a physics-based hurricane wind field model, the superiority of the proposed method is validated using real-world time-series power outage data from Puerto Rico during Hurricane Fiona 2022.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Communication-Efficient Personalized Federated Learning for Speech-to-Text Tasks
Authors:
Yichao Du,
Zhirui Zhang,
Linan Yue,
Xu Huang,
Yuqing Zhang,
Tong Xu,
Linli Xu,
Enhong Chen
Abstract:
To protect privacy and meet legal regulations, federated learning (FL) has gained significant attention for training speech-to-text (S2T) systems, including automatic speech recognition (ASR) and speech translation (ST). However, the commonly used FL approach (i.e., \textsc{FedAvg}) in S2T tasks typically suffers from extensive communication overhead due to multi-round interactions based on the wh…
▽ More
To protect privacy and meet legal regulations, federated learning (FL) has gained significant attention for training speech-to-text (S2T) systems, including automatic speech recognition (ASR) and speech translation (ST). However, the commonly used FL approach (i.e., \textsc{FedAvg}) in S2T tasks typically suffers from extensive communication overhead due to multi-round interactions based on the whole model and performance degradation caused by data heterogeneity among clients.To address these issues, we propose a personalized federated S2T framework that introduces \textsc{FedLoRA}, a lightweight LoRA module for client-side tuning and interaction with the server to minimize communication overhead, and \textsc{FedMem}, a global model equipped with a $k$-nearest-neighbor ($k$NN) classifier that captures client-specific distributional shifts to achieve personalization and overcome data heterogeneity. Extensive experiments based on Conformer and Whisper backbone models on CoVoST and GigaSpeech benchmarks show that our approach significantly reduces the communication overhead on all S2T tasks and effectively personalizes the global model to overcome data heterogeneity.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
An Improved Virtual Force Approach for UAV Deployment and Resource Allocation in Emergency Communications
Authors:
Hongying Guo,
Li Wang,
Ruoguang Li,
Luyang Hou,
Lianming Xu,
Aiguo Fei
Abstract:
In this paper, we consider an unmanned aerial vehicle (UAV)-enabled emergency communication system, which establishes temporary communication link with users equipment (UEs) in a typical disaster environment with mountainous forest and obstacles. Towards this end, a joint deployment, power allocation, and user association optimization problem is formulated to maximize the total transmission rate,…
▽ More
In this paper, we consider an unmanned aerial vehicle (UAV)-enabled emergency communication system, which establishes temporary communication link with users equipment (UEs) in a typical disaster environment with mountainous forest and obstacles. Towards this end, a joint deployment, power allocation, and user association optimization problem is formulated to maximize the total transmission rate, while considering the demand of each UE and the disaster environment characteristics. Then, an alternating optimization algorithm is proposed by integrating coalition game and virtual force approach which captures the impact of the demand priority of UEs and the obstacles to the flight path and consumed power. Simulation results demonstrate that the computation time consumed by our proposed algorithm is only $5.6\%$ of the traditional heuristic algorithms, which validates its effectiveness in disaster scenarios.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
UAV-assisted Emergency Integrated Sensing and Communication Networks: A CNN-based Rapid Deployment Approach
Authors:
Zao Wang,
Lianming Xu,
Luyang Hou,
Ruoguang Li,
Li Wang
Abstract:
UAV-assisted integrated sensing and communication (ISAC) network is crucial for post-disaster emergency rescue. The speed of UAV deployment will directly impact rescue results. However, the ISAC UAV deployment in emergency scenarios is difficult to solve, which contradicts the rapid deployment. In this paper, we propose a two-stage deployment framework to achieve rapid ISAC UAV deployment in emerg…
▽ More
UAV-assisted integrated sensing and communication (ISAC) network is crucial for post-disaster emergency rescue. The speed of UAV deployment will directly impact rescue results. However, the ISAC UAV deployment in emergency scenarios is difficult to solve, which contradicts the rapid deployment. In this paper, we propose a two-stage deployment framework to achieve rapid ISAC UAV deployment in emergency scenarios, which consists of an offline stage and an online stage. Specifically, in the offline stage, we first formulate the ISAC UAV deployment problem and define the ISAC utility as the objective function, which integrates communication rate and localization accuracy. Secondly, we develop a dynamic particle swarm optimization (DPSO) algorithm to construct an optimized UAV deployment dataset. Finally, we train a convolutional neural network (CNN) model with this dataset, which replaces the time-consuming DPSO algorithm. In the online stage, the trained CNN model can be used to make quick decisions for the ISAC UAV deployment. The simulation results indicate that the trained CNN model achieves superior ISAC performance compared to the classic particle swarm optimization algorithm. Additionally, it significantly reduces the deployment time by more than 96%.
△ Less
Submitted 13 January, 2024;
originally announced January 2024.
-
Transfer the linguistic representations from TTS to accent conversion with non-parallel data
Authors:
Xi Chen,
Jiakun Pei,
Liumeng Xue,
Mingyang Zhang
Abstract:
Accent conversion aims to convert the accent of a source speech to a target accent, meanwhile preserving the speaker's identity. This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech. Specifically, the proposed system aligns speech representations with lingu…
▽ More
Accent conversion aims to convert the accent of a source speech to a target accent, meanwhile preserving the speaker's identity. This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech. Specifically, the proposed system aligns speech representations with linguistic representations obtained from Text-to-Speech (TTS) systems, enabling training of the accent voice conversion model on non-parallel data. Furthermore, we investigate the effectiveness of a pretraining strategy on native data and different acoustic features within our proposed framework. We conduct a comprehensive evaluation using both subjective and objective metrics to assess the performance of our approach. The evaluation results highlight the benefits of the pretraining strategy and the incorporation of richer semantic features, resulting in significantly enhanced audio quality and intelligibility.
△ Less
Submitted 7 January, 2024;
originally announced January 2024.
-
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit
Authors:
Xueyao Zhang,
Liumeng Xue,
Yicheng Gu,
Yuancheng Wang,
Haorui He,
Chaoren Wang,
Xi Chen,
Zihao Fang,
Haopeng Chen,
Junan Zhang,
Tze Ying Tang,
Lexiao Zou,
Mingxuan Wang,
Jun Han,
Kai Chen,
Haizhou Li,
Zhizheng Wu
Abstract:
Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way for junior researchers and engineers into these fields. It presents a unified framework that is inclusive of diverse generation tasks and models, with the added bonus of being easily extendable for new incorporation. The toolkit is designed with beginner-friendly workflows and pre-trained models, a…
▽ More
Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way for junior researchers and engineers into these fields. It presents a unified framework that is inclusive of diverse generation tasks and models, with the added bonus of being easily extendable for new incorporation. The toolkit is designed with beginner-friendly workflows and pre-trained models, allowing both beginners and seasoned researchers to kick-start their projects with relative ease. Additionally, it provides interactive visualizations and demonstrations of classic models for educational purposes. The initial release of Amphion v0.1 supports a range of tasks including Text to Speech (TTS), Text to Audio (TTA), and Singing Voice Conversion (SVC), supplemented by essential components like data preprocessing, state-of-the-art vocoders, and evaluation metrics. This paper presents a high-level overview of Amphion.
△ Less
Submitted 22 February, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
Learning with Noisy Low-Cost MOS for Image Quality Assessment via Dual-Bias Calibration
Authors:
Lei Wang,
Qingbo Wu,
Desen Yuan,
King Ngi Ngan,
Hongliang Li,
Fanman Meng,
Linfeng Xu
Abstract:
Learning based image quality assessment (IQA) models have obtained impressive performance with the help of reliable subjective quality labels, where mean opinion score (MOS) is the most popular choice. However, in view of the subjective bias of individual annotators, the labor-abundant MOS (LA-MOS) typically requires a large collection of opinion scores from multiple annotators for each image, whi…
▽ More
Learning based image quality assessment (IQA) models have obtained impressive performance with the help of reliable subjective quality labels, where mean opinion score (MOS) is the most popular choice. However, in view of the subjective bias of individual annotators, the labor-abundant MOS (LA-MOS) typically requires a large collection of opinion scores from multiple annotators for each image, which significantly increases the learning cost. In this paper, we aim to learn robust IQA models from low-cost MOS (LC-MOS), which only requires very few opinion scores or even a single opinion score for each image. More specifically, we consider the LC-MOS as the noisy observation of LA-MOS and enforce the IQA model learned from LC-MOS to approach the unbiased estimation of LA-MOS. In this way, we represent the subjective bias between LC-MOS and LA-MOS, and the model bias between IQA predictions learned from LC-MOS and LA-MOS (i.e., dual-bias) as two latent variables with unknown parameters. By means of the expectation-maximization based alternating optimization, we can jointly estimate the parameters of the dual-bias, which suppresses the misleading of LC-MOS via a gated dual-bias calibration (GDBC) module. To the best of our knowledge, this is the first exploration of robust IQA model learning from noisy low-cost labels. Theoretical analysis and extensive experiments on four popular IQA datasets show that the proposed method is robust toward different bias rates and annotation numbers and significantly outperforms the other learning based IQA models when only LC-MOS is available. Furthermore, we also achieve comparable performance with respect to the other models learned with LA-MOS.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder
Authors:
Yicheng Gu,
Xueyao Zhang,
Liumeng Xue,
Zhizheng Wu
Abstract:
Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators are rooted in Short-Time Fourier Transform (STFT), whose time-frequency res…
▽ More
Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators are rooted in Short-Time Fourier Transform (STFT), whose time-frequency resolution in a spectrogram is fixed, making it incompatible with signals like singing voices that require flexible attention for different frequency bands. Motivated by that, our study utilizes the Constant-Q Transform (CQT), which owns dynamic resolution among frequencies, contributing to a better modeling ability in pitch accuracy and harmonic tracking. Specifically, we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates on the CQT spectrogram at multiple scales and performs sub-band processing according to different octaves. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed method. Moreover, we also verified that the CQT-based and the STFT-based discriminators could be complementary under joint training. Specifically, enhanced by the proposed MS-SB-CQT and the existing MS-STFT Discriminators, the MOS of HiFi-GAN can be boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen singers.
△ Less
Submitted 25 November, 2023;
originally announced November 2023.
-
Deep Neural Network Identification of Limnonectes Species and New Class Detection Using Image Data
Authors:
Li Xu,
Yili Hong,
Eric P. Smith,
David S. McLeod,
Xinwei Deng,
Laura J. Freeman
Abstract:
As is true of many complex tasks, the work of discovering, describing, and understanding the diversity of life on Earth (viz., biological systematics and taxonomy) requires many tools. Some of this work can be accomplished as it has been done in the past, but some aspects present us with challenges which traditional knowledge and tools cannot adequately resolve. One such challenge is presented by…
▽ More
As is true of many complex tasks, the work of discovering, describing, and understanding the diversity of life on Earth (viz., biological systematics and taxonomy) requires many tools. Some of this work can be accomplished as it has been done in the past, but some aspects present us with challenges which traditional knowledge and tools cannot adequately resolve. One such challenge is presented by species complexes in which the morphological similarities among the group members make it difficult to reliably identify known species and detect new ones. We address this challenge by develo** new tools using the principles of machine learning to resolve two specific questions related to species complexes. The first question is formulated as a classification problem in statistics and machine learning and the second question is an out-of-distribution (OOD) detection problem. We apply these tools to a species complex comprising Southeast Asian stream frogs (Limnonectes kuhlii complex) and employ a morphological character (hind limb skin texture) traditionally treated qualitatively in a quantitative and objective manner. We demonstrate that deep neural networks can successfully automate the classification of an image into a known species group for which it has been trained. We further demonstrate that the algorithm can successfully classify an image into a new class if the image does not belong to the existing classes. Additionally, we use the larger MNIST dataset to test the performance of our OOD detection algorithm. We finish our paper with some concluding remarks regarding the application of these methods to species complexes and our efforts to document true biodiversity. This paper has online supplementary materials.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Detection of Small Targets in Sea Clutter Based on RepVGG and Continuous Wavelet Transform
Authors:
**gchen Ni,
Haoru Li,
Lilin Xu,
**g Liang
Abstract:
Constructing a high-performance target detector under the background of sea clutter is always necessary and important. In this work, we propose a RepVGGA0-CWT detector, where RepVGG is a residual network that gains a high detection accuracy. Different from traditional residual networks, RepVGG keeps an acceptable calculation speed. Giving consideration to both accuracy and speed, the RepVGGA0 is s…
▽ More
Constructing a high-performance target detector under the background of sea clutter is always necessary and important. In this work, we propose a RepVGGA0-CWT detector, where RepVGG is a residual network that gains a high detection accuracy. Different from traditional residual networks, RepVGG keeps an acceptable calculation speed. Giving consideration to both accuracy and speed, the RepVGGA0 is selected among all the variants of RepVGG. Also, continuous wavelet transform (CWT) is employed to extract the radar echoes' time-frequency feature effectively. In the tests, other networks (ResNet50, ResNet18 and AlexNet) and feature extraction methods (short-time Fourier transform (STFT), CWT) are combined to build detectors for comparison. The result of different datasets shows that the RepVGGA0-CWT detector performs better than those detectors in terms of low controllable false alarm rate, high training speed, high inference speed and low memory usage. This RepVGGA0-CWT detector is hardware-friendly and can be applied in real-time scenes for its high inference speed in detection.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
SponTTS: modeling and transferring spontaneous style for TTS
Authors:
Hanzhao Li,
Xinfa Zhu,
Liumeng Xue,
Yang Song,
Yunlin Chen,
Lei Xie
Abstract:
Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like a smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous d…
▽ More
Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like a smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous data constrains spontaneous speech generation for speakers without spontaneous data. To address these problems, we propose SponTTS, a two-stage approach based on neural bottleneck (BN) features to model and transfer spontaneous style for TTS. In the first stage, we adopt a Conditional Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature and involve the spontaneous phenomena by the constraint of spontaneous phenomena embedding prediction loss. Besides, we introduce a flow-based predictor to predict a latent spontaneous style representation from the text, which enriches the prosody and context-specific spontaneous phenomena during inference. In the second stage, we adopt a VITS-like module to transfer the spontaneous style learned in the first stage to the target speakers. Experiments demonstrate that SponTTS is effective in modeling spontaneous style and transferring the style to the target speakers, generating spontaneous speech with high naturalness, expressiveness, and speaker similarity. The zero-shot spontaneous style TTS test further verifies the generalization and robustness of SponTTS in generating spontaneous speech for unseen speakers.
△ Less
Submitted 8 January, 2024; v1 submitted 13 November, 2023;
originally announced November 2023.
-
Optimization of RIS Placement for Satellite-to-Ground Coverage Enhancement
Authors:
Xingchen Liu,
Liuxun Xue,
Shu Sun,
Meixia Tao
Abstract:
In satellite-to-ground communication, ensuring reliable and efficient connectivity poses significant challenges. The reconfigurable intelligent surface (RIS) offers a promising solution due to its ability to manipulate wireless propagation environments and thus enhance communication performance. In this paper, we propose a method for optimizing the placement of RISs on building facets to improve s…
▽ More
In satellite-to-ground communication, ensuring reliable and efficient connectivity poses significant challenges. The reconfigurable intelligent surface (RIS) offers a promising solution due to its ability to manipulate wireless propagation environments and thus enhance communication performance. In this paper, we propose a method for optimizing the placement of RISs on building facets to improve satellite-to-ground communication coverage. We model satellite-to-ground communication with RIS assistance, considering the actual positions of buildings and ground users. The theoretical lower bound on the coverage enhancement in satellite-to-ground communication through large-scale RIS deployment is derived. Then a novel optimization framework for RIS placement is formulated, and a parallel genetic algorithm is employed to solve the problem. Simulation results demonstrate the superior performance of the proposed RIS deployment strategy in enhancing satellite communication coverage probability for non-line-of-sight users. The proposed framework can be applied to various architectural distributions, such as rural areas, towns, and cities, by adjusting parameter settings.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Multi-Agent Consensus Seeking via Large Language Models
Authors:
Huaben Chen,
Wenkang Ji,
Lufeng Xu,
Shiyu Zhao
Abstract:
Multi-agent systems driven by large language models (LLMs) have shown promising abilities for solving complex tasks in a collaborative manner. This work considers a fundamental problem in multi-agent collaboration: consensus seeking. When multiple agents work together, we are interested in how they can reach a consensus through inter-agent negotiation. To that end, this work studies a consensus-se…
▽ More
Multi-agent systems driven by large language models (LLMs) have shown promising abilities for solving complex tasks in a collaborative manner. This work considers a fundamental problem in multi-agent collaboration: consensus seeking. When multiple agents work together, we are interested in how they can reach a consensus through inter-agent negotiation. To that end, this work studies a consensus-seeking task where the state of each agent is a numerical value and they negotiate with each other to reach a consensus value. It is revealed that when not explicitly directed on which strategy should be adopted, the LLM-driven agents primarily use the average strategy for consensus seeking although they may occasionally use some other strategies. Moreover, this work analyzes the impact of the agent number, agent personality, and network topology on the negotiation process. The findings reported in this work can potentially lay the foundations for understanding the behaviors of LLM-driven multi-agent systems for solving more complex tasks. Furthermore, LLM-driven consensus seeking is applied to a multi-robot aggregation task. This application demonstrates the potential of LLM-driven agents to achieve zero-shot autonomous planning for multi-robot collaboration tasks. Project website: westlakeintelligentrobotics.github.io/ConsensusLLM/.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion
Authors:
Xueyao Zhang,
Yicheng Gu,
Haopeng Chen,
Zihao Fang,
Lexiao Zou,
Junan Zhang,
Liumeng Xue,
**chao Zhang,
Jie Zhou,
Zhizheng Wu
Abstract:
Singing Voice Conversion (SVC) is a technique that enables any singer to perform any song. To achieve this, it is essential to obtain speaker-agnostic representations from the source audio, which poses a significant challenge. A common solution involves utilizing a semantic-based audio pretrained model as a feature extractor. However, the degree to which the extracted features can meet the SVC req…
▽ More
Singing Voice Conversion (SVC) is a technique that enables any singer to perform any song. To achieve this, it is essential to obtain speaker-agnostic representations from the source audio, which poses a significant challenge. A common solution involves utilizing a semantic-based audio pretrained model as a feature extractor. However, the degree to which the extracted features can meet the SVC requirements remains an open question. This includes their capability to accurately model melody and lyrics, the speaker-independency of their underlying acoustic information, and their robustness for in-the-wild acoustic environments. In this study, we investigate the knowledge within classical semantic-based pretrained models in much detail. We discover that the knowledge of different models is diverse and can be complementary for SVC. To jointly utilize the diverse pretrained models with mismatched time resolutions, we propose an efficient ReTrans strategy to address the feature fusion problem. Based on the above, we design a Singing Voice Conversion framework based on Diverse Semantic-based Feature Fusion (DSFF-SVC). Experimental results demonstrate that DSFF-SVC can be generalized and improve various existing SVC models, particularly in challenging real-world conversion tasks.
△ Less
Submitted 27 May, 2024; v1 submitted 17 October, 2023;
originally announced October 2023.
-
Automatic nodule identification and differentiation in ultrasound videos to facilitate per-nodule examination
Authors:
Siyuan Jiang,
Yan Ding,
Yuling Wang,
Lei Xu,
Wenli Dai,
Wanru Chang,
Jianfeng Zhang,
Jie Yu,
Jianqiao Zhou,
Chunquan Zhang,
** Liang,
Dexing Kong
Abstract:
Ultrasound is a vital diagnostic technique in health screening, with the advantages of non-invasive, cost-effective, and radiation free, and therefore is widely applied in the diagnosis of nodules. However, it relies heavily on the expertise and clinical experience of the sonographer. In ultrasound images, a single nodule might present heterogeneous appearances in different cross-sectional views w…
▽ More
Ultrasound is a vital diagnostic technique in health screening, with the advantages of non-invasive, cost-effective, and radiation free, and therefore is widely applied in the diagnosis of nodules. However, it relies heavily on the expertise and clinical experience of the sonographer. In ultrasound images, a single nodule might present heterogeneous appearances in different cross-sectional views which makes it hard to perform per-nodule examination. Sonographers usually discriminate different nodules by examining the nodule features and the surrounding structures like gland and duct, which is cumbersome and time-consuming. To address this problem, we collected hundreds of breast ultrasound videos and built a nodule reidentification system that consists of two parts: an extractor based on the deep learning model that can extract feature vectors from the input video clips and a real-time clustering algorithm that automatically groups feature vectors by nodules. The system obtains satisfactory results and exhibits the capability to differentiate ultrasound videos. As far as we know, it's the first attempt to apply re-identification technique in the ultrasonic field.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.