-
Recurrent Inference Machine for Medical Image Registration
Authors:
Yi Zhang,
Yidong Zhao,
Hui Xue,
Peter Kellman,
Stefan Klein,
Qian Tao
Abstract:
Image registration is essential for medical image applications where alignment of voxels across multiple images is needed for qualitative or quantitative analysis. With recent advancements in deep neural networks and parallel computing, deep learning-based medical image registration methods become competitive with their flexible modelling and fast inference capabilities. However, compared to tradi…
▽ More
Image registration is essential for medical image applications where alignment of voxels across multiple images is needed for qualitative or quantitative analysis. With recent advancements in deep neural networks and parallel computing, deep learning-based medical image registration methods become competitive with their flexible modelling and fast inference capabilities. However, compared to traditional optimization-based registration methods, the speed advantage may come at the cost of registration performance at inference time. Besides, deep neural networks ideally demand large training datasets while optimization-based methods are training-free. To improve registration accuracy and data efficiency, we propose a novel image registration method, termed Recurrent Inference Image Registration (RIIR) network. RIIR is formulated as a meta-learning solver to the registration problem in an iterative manner. RIIR addresses the accuracy and data efficiency issues, by learning the update rule of optimization, with implicit regularization combined with explicit gradient input.
We evaluated RIIR extensively on brain MRI and quantitative cardiac MRI datasets, in terms of both registration accuracy and training data efficiency. Our experiments showed that RIIR outperformed a range of deep learning-based methods, even with only $5\%$ of the training data, demonstrating high data efficiency. Key findings from our ablation studies highlighted the important added value of the hidden states introduced in the recurrent inference framework for meta-learning. Our proposed RIIR offers a highly data-efficient framework for deep learning-based medical image registration.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Communication-Efficient MARL for Platoon Stability and Energy-efficiency Co-optimization in Cooperative Adaptive Cruise Control of CAVs
Authors:
Min Hua,
Dong Chen,
Kun Jiang,
Fanggang Zhang,
**hai Wang,
Bo Wang,
Quan Zhou,
Hongming Xu
Abstract:
Cooperative adaptive cruise control (CACC) has been recognized as a fundamental function of autonomous driving, in which platoon stability and energy efficiency are outstanding challenges that are difficult to accommodate in real-world operations. This paper studied the CACC of connected and autonomous vehicles (CAVs) based on the multi-agent reinforcement learning algorithm (MARL) to optimize pla…
▽ More
Cooperative adaptive cruise control (CACC) has been recognized as a fundamental function of autonomous driving, in which platoon stability and energy efficiency are outstanding challenges that are difficult to accommodate in real-world operations. This paper studied the CACC of connected and autonomous vehicles (CAVs) based on the multi-agent reinforcement learning algorithm (MARL) to optimize platoon stability and energy efficiency simultaneously. The optimal use of communication bandwidth is the key to guaranteeing learning performance in real-world driving, and thus this paper proposes a communication-efficient MARL by incorporating the quantified stochastic gradient descent (QSGD) and a binary differential consensus (BDC) method into a fully-decentralized MARL framework. We benchmarked the performance of our proposed BDC-MARL algorithm against several several non-communicative andcommunicative MARL algorithms, e.g., IA2C, FPrint, and DIAL, through the evaluation of platoon stability, fuel economy, and driving comfort. Our results show that BDC-MARL achieved the highest energy savings, improving by up to 5.8%, with an average velocity of 15.26 m/s and an inter-vehicle spacing of 20.76 m. In addition, we conducted different information-sharing analyses to assess communication efficacy, along with sensitivity analyses and scalability tests with varying platoon sizes. The practical effectiveness of our approach is further demonstrated using real-world scenarios sourced from open-sourced OpenACC.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Approximate Angular Domain Expression for Near-Field XL-MIMO Channel
Authors:
Hongbo Xing,
Yuxiang Zhang,
Jianhua Zhang,
Huixin Xu,
Guangyi Liu,
Qixing Wang
Abstract:
As Extremely Large-Scale Multiple-Input-Multiple-Output (XL-MIMO) technology advances and frequency band rises, the near-field effects in communication are intensifying. A concise and accurate near-field XL-MIMO channel model serves as the cornerstone for investigating the near-field effects. However, existing angular domain XL-MIMO channel models under near-field conditions require non-closed-for…
▽ More
As Extremely Large-Scale Multiple-Input-Multiple-Output (XL-MIMO) technology advances and frequency band rises, the near-field effects in communication are intensifying. A concise and accurate near-field XL-MIMO channel model serves as the cornerstone for investigating the near-field effects. However, existing angular domain XL-MIMO channel models under near-field conditions require non-closed-form wave-number domain integrals for computation, which is complicated. To obtain a more succinct channel model, this paper introduces a closed-form approximate expression based on the principle of stationary phase. It was subsequently shown that when the scatterer distance is larger than the array aperture, the closed-form model can be further simplified as a trapezoidal spectrum. We validate the accuracy of the proposed approximation through simulations of power angular spectrum similarity. The results indicate that the proposed approximation can accurately approximate the near-field angular domain channel within the effective Rayleigh distance.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Balancing Performance and Cost for Two-Hop Cooperative Communications: Stackelberg Game and Distributed Multi-Agent Reinforcement Learning
Authors:
Yuanzhe Geng,
Erwu Liu,
Wei Ni,
Rui Wang,
Yan Liu,
Hao Xu,
Chen Cai,
Abbas Jamalipour
Abstract:
This paper aims to balance performance and cost in a two-hop wireless cooperative communication network where the source and relays have contradictory optimization goals and make decisions in a distributed manner. This differs from most existing works that have typically assumed that source and relay nodes follow a schedule created implicitly by a central controller. We propose that the relays for…
▽ More
This paper aims to balance performance and cost in a two-hop wireless cooperative communication network where the source and relays have contradictory optimization goals and make decisions in a distributed manner. This differs from most existing works that have typically assumed that source and relay nodes follow a schedule created implicitly by a central controller. We propose that the relays form an alliance in an attempt to maximize the benefit of relaying while the source aims to increase the channel capacity cost-effectively. To this end, we establish the trade problem as a Stackelberg game, and prove the existence of its equilibrium. Another important aspect is that we use multi-agent reinforcement learning (MARL) to approach the equilibrium in a situation where the instantaneous channel state information (CSI) is unavailable, and the source and relays do not have knowledge of each other's goal. A multi-agent deep deterministic policy gradient-based framework is designed, where the relay alliance and the source act as agents. Experiments demonstrate that the proposed method can obtain an acceptable performance that is close to the game-theoretic equilibrium for all players under time-invariant environments, which considerably outperforms its potential alternatives and is only about 2.9% away from the optimal solution.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model
Authors:
Zhaoqing Li,
Haoning Xu,
Tianzi Wang,
Shoukang Hu,
Zengrui **,
Shujie Hu,
Jiajun Deng,
Mingyu Cui,
Mengzhe Geng,
Xunying Liu
Abstract:
We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision settings to be simultaneously constructed without the need to train and store individual target systems separately. Experiments consistently demonstrat…
▽ More
We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision settings to be simultaneously constructed without the need to train and store individual target systems separately. Experiments consistently demonstrate the multiple ASR systems compressed in a single all-in-one model produced a word error rate (WER) comparable to, or lower by up to 1.01\% absolute (6.98\% relative) than individually trained systems of equal complexity. A 3.4x overall system compression and training time speed-up was achieved. Maximum model size compression ratios of 12.8x and 3.93x were obtained over the baseline Switchboard-300hr Conformer and LibriSpeech-100hr fine-tuned wav2vec2.0 models, respectively, incurring no statistically significant WER increase.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
RSEND: Retinex-based Squeeze and Excitation Network with Dark Region Detection for Efficient Low Light Image Enhancement
Authors:
**gcheng Li,
Ye Qiao,
Haocheng Xu,
Sitao Huang
Abstract:
Images captured under low-light scenarios often suffer from low quality. Previous CNN-based deep learning methods often involve using Retinex theory. Nevertheless, most of them cannot perform well in more complicated datasets like LOL-v2 while consuming too much computational resources. Besides, some of these methods require sophisticated training at different stages, making the procedure even mor…
▽ More
Images captured under low-light scenarios often suffer from low quality. Previous CNN-based deep learning methods often involve using Retinex theory. Nevertheless, most of them cannot perform well in more complicated datasets like LOL-v2 while consuming too much computational resources. Besides, some of these methods require sophisticated training at different stages, making the procedure even more time-consuming and tedious. In this paper, we propose a more accurate, concise, and one-stage Retinex theory based framework, RSEND. RSEND first divides the low-light image into the illumination map and reflectance map, then captures the important details in the illumination map and performs light enhancement. After this step, it refines the enhanced gray-scale image and does element-wise matrix multiplication with the reflectance map. By denoising the output it has from the previous step, it obtains the final result. In all the steps, RSEND utilizes Squeeze and Excitation network to better capture the details. Comprehensive quantitative and qualitative experiments show that our Efficient Retinex model significantly outperforms other CNN-based models, achieving a PSNR improvement ranging from 0.44 dB to 4.2 dB in different datasets and even outperforms transformer-based models in the LOL-v2-real dataset.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection
Authors:
Rong Gong,
Hongfei Xue,
Lezhi Wang,
Xin Xu,
Qisheng Li,
Lei Xie,
Hui Bu,
Shaomei Wu,
Jiaming Zhou,
Yong Qin,
Binbin Zhang,
Jun Du,
Jia Bin,
Ming Li
Abstract:
The rapid advancements in speech technologies over the past two decades have led to human-level performance in tasks like automatic speech recognition (ASR) for fluent speech. However, the efficacy of these models diminishes when applied to atypical speech, such as stuttering. This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset, which stands out as the large…
▽ More
The rapid advancements in speech technologies over the past two decades have led to human-level performance in tasks like automatic speech recognition (ASR) for fluent speech. However, the efficacy of these models diminishes when applied to atypical speech, such as stuttering. This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset, which stands out as the largest dataset in its category. Encompassing conversational and voice command reading speech, AS-70 includes verbatim manual transcription, rendering it suitable for various speech-related tasks. Furthermore, baseline systems are established, and experimental results are presented for ASR and stuttering event detection (SED) tasks. By incorporating this dataset into the model fine-tuning, significant improvements in the state-of-the-art ASR models, e.g., Whisper and Hubert, are observed, enhancing their inclusivity in addressing stuttered speech.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Transforming Heart Chamber Imaging: Self-Supervised Learning for Whole Heart Reconstruction and Segmentation
Authors:
Abdul Qayyum,
Hao Xu,
Brian P. Halliday,
Cristobal Rodero,
Christopher W. Lanyon,
Richard D. Wilkinson,
Steven Alexander Niederer
Abstract:
Automated segmentation of Cardiac Magnetic Resonance (CMR) plays a pivotal role in efficiently assessing cardiac function, offering rapid clinical evaluations that benefit both healthcare practitioners and patients. While recent research has primarily focused on delineating structures in the short-axis orientation, less attention has been given to long-axis representations, mainly due to the compl…
▽ More
Automated segmentation of Cardiac Magnetic Resonance (CMR) plays a pivotal role in efficiently assessing cardiac function, offering rapid clinical evaluations that benefit both healthcare practitioners and patients. While recent research has primarily focused on delineating structures in the short-axis orientation, less attention has been given to long-axis representations, mainly due to the complex nature of structures in this orientation. Performing pixel-wise segmentation of the left ventricular (LV) myocardium and the four cardiac chambers in 2-D steady-state free precession (SSFP) cine sequences is a crucial preprocessing stage for various analyses. However, the challenge lies in the significant variability in contrast, appearance, orientation, and positioning of the heart across different patients, clinical views, scanners, and imaging protocols. Consequently, achieving fully automatic semantic segmentation in this context is notoriously challenging. In recent years, several deep learning models have been proposed to accurately quantify and diagnose cardiac pathologies. These automated tools heavily rely on the accurate segmentation of cardiac structures in magnetic resonance images (MRI). Hence, there is a need for new methods to handle such structures' geometrical and textural complexities. We proposed 2D and 3D two-stage self-supervised deep learning segmentation hybrid transformer and CNN-based architectures for 4CH whole heart segmentation. Accurate segmentation of the ventricles and atria in 4CH views is crucial for analyzing heart health and reconstructing four-chamber meshes, which are essential for estimating various parameters to assess overall heart condition. Our proposed method outperformed state-of-the-art techniques, demonstrating superior performance in this domain.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Label-Loo**: Highly Efficient Decoding for Transducers
Authors:
Vladimir Bataev,
Hainan Xu,
Daniel Galvez,
Vitaly Lavrukhin,
Boris Ginsburg
Abstract:
This paper introduces a highly efficient greedy decoding algorithm for Transducer inference. We propose a novel data structure using CUDA tensors to represent partial hypotheses in a batch that supports parallelized hypothesis manipulations. During decoding, our algorithm maximizes GPU parallelism by adopting a nested-loop design, where the inner loop consumes all blank predictions, while non-blan…
▽ More
This paper introduces a highly efficient greedy decoding algorithm for Transducer inference. We propose a novel data structure using CUDA tensors to represent partial hypotheses in a batch that supports parallelized hypothesis manipulations. During decoding, our algorithm maximizes GPU parallelism by adopting a nested-loop design, where the inner loop consumes all blank predictions, while non-blank predictions are handled in the outer loop. Our algorithm is general-purpose and can work with both conventional Transducers and Token-and-Duration Transducers. Experiments show that the label-loo** algorithm can bring a speedup up to 2.0X compared to conventional batched decoding algorithms when using batch size 32, and can be combined with other compiler or GPU call-related techniques to bring more speedup. We will open-source our implementation to benefit the research community.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Sok: Comprehensive Security Overview, Challenges, and Future Directions of Voice-Controlled Systems
Authors:
Haozhe Xu,
Cong Wu,
Yangyang Gu,
Xingcan Shang,
**g Chen,
Kun He,
Ruiying Du
Abstract:
The integration of Voice Control Systems (VCS) into smart devices and their growing presence in daily life accentuate the importance of their security. Current research has uncovered numerous vulnerabilities in VCS, presenting significant risks to user privacy and security. However, a cohesive and systematic examination of these vulnerabilities and the corresponding solutions is still absent. This…
▽ More
The integration of Voice Control Systems (VCS) into smart devices and their growing presence in daily life accentuate the importance of their security. Current research has uncovered numerous vulnerabilities in VCS, presenting significant risks to user privacy and security. However, a cohesive and systematic examination of these vulnerabilities and the corresponding solutions is still absent. This lack of comprehensive analysis presents a challenge for VCS designers in fully understanding and mitigating the security issues within these systems.
Addressing this gap, our study introduces a hierarchical model structure for VCS, providing a novel lens for categorizing and analyzing existing literature in a systematic manner. We classify attacks based on their technical principles and thoroughly evaluate various attributes, such as their methods, targets, vectors, and behaviors. Furthermore, we consolidate and assess the defense mechanisms proposed in current research, offering actionable recommendations for enhancing VCS security. Our work makes a significant contribution by simplifying the complexity inherent in VCS security, aiding designers in effectively identifying and countering potential threats, and setting a foundation for future advancements in VCS security research.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Transmission Interface Power Flow Adjustment: A Deep Reinforcement Learning Approach based on Multi-task Attribution Map
Authors:
Shunyu Liu,
Wei Luo,
Yanzhen Zhou,
Kaixuan Chen,
Quan Zhang,
Huating Xu,
Qinglai Guo,
Mingli Song
Abstract:
Transmission interface power flow adjustment is a critical measure to ensure the security and economy operation of power systems. However, conventional model-based adjustment schemes are limited by the increasing variations and uncertainties occur in power systems, where the adjustment problems of different transmission interfaces are often treated as several independent tasks, ignoring their coup…
▽ More
Transmission interface power flow adjustment is a critical measure to ensure the security and economy operation of power systems. However, conventional model-based adjustment schemes are limited by the increasing variations and uncertainties occur in power systems, where the adjustment problems of different transmission interfaces are often treated as several independent tasks, ignoring their coupling relationship and even leading to conflict decisions. In this paper, we introduce a novel data-driven deep reinforcement learning (DRL) approach, to handle multiple power flow adjustment tasks jointly instead of learning each task from scratch. At the heart of the proposed method is a multi-task attribution map (MAM), which enables the DRL agent to explicitly attribute each transmission interface task to different power system nodes with task-adaptive attention weights. Based on this MAM, the agent can further provide effective strategies to solve the multi-task adjustment problem with a near-optimal operation cost. Simulation results on the IEEE 118-bus system, a realistic 300-bus system in China, and a very large European system with 9241 buses demonstrate that the proposed method significantly improves the performance compared with several baseline methods, and exhibits high interpretability with the learnable MAM.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Channel Estimation and Reconstruction in Fluid Antenna System: Oversampling is Essential
Authors:
Wee Kiat New,
Kai-Kit Wong,
Hao Xu,
Farshad Rostami Ghadi,
Ross Murch,
Chan-Byoung Chae
Abstract:
Fluid antenna system (FAS) has recently surfaced as a promising technology for the upcoming sixth generation (6G) wireless networks. Unlike traditional antenna system (TAS) with fixed antenna location, FAS introduces a flexible component where the radiating element can switch its position within a predefined space. This capability allows FAS to achieve additional diversity and multiplexing gains.…
▽ More
Fluid antenna system (FAS) has recently surfaced as a promising technology for the upcoming sixth generation (6G) wireless networks. Unlike traditional antenna system (TAS) with fixed antenna location, FAS introduces a flexible component where the radiating element can switch its position within a predefined space. This capability allows FAS to achieve additional diversity and multiplexing gains. Nevertheless, to fully reap the benefits of FAS, obtaining channel state information (CSI) over the predefined space is crucial. In this paper, we explore the interaction between a transmitter equipped with a traditional antenna and a receiver with a fluid antenna over an electromagnetic-compliant channel model. We address the challenges of channel estimation and reconstruction using Nyquist sampling and maximum likelihood estimation (MLE) methods. Our analysis reveals a fundamental tradeoff between the accuracy of the reconstructed channel and the number of estimated channels, indicating that half-wavelength sampling is insufficient for perfect reconstruction and that oversampling is essential to enhance accuracy. Despite its advantages, oversampling can introduce practical challenges. Consequently, we propose a suboptimal sampling distance that facilitates efficient channel reconstruction. In addition, we employ the MLE method to bound the channel estimation error by $ε$, with a specific confidence interval (CI). Our findings enable us to determine the minimum number of estimated channels and the total number of pilot symbols required for efficient channel reconstruction in a given space. Lastly, we investigate the rate performance of FAS and TAS and demonstrate that FAS with imperfect CSI can outperform TAS with perfect CSI.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Shifting the ISAC Trade-Off with Fluid Antenna Systems
Authors:
Jiaqi Zou,
Hao Xu,
Chao Wang,
Lvxin Xu,
Songlin Sun,
Kaitao Meng,
Christos Masouros,
Kai-Kit Wong
Abstract:
As an emerging antenna technology, a fluid antenna system (FAS) enhances spatial diversity to improve both sensing and communication performance by shifting the active antennas among available ports. In this letter, we study the potential of shifting the integrated sensing and communication (ISAC) trade-off with FAS. We propose the model for FAS-enabled ISAC and jointly optimize the transmit beamf…
▽ More
As an emerging antenna technology, a fluid antenna system (FAS) enhances spatial diversity to improve both sensing and communication performance by shifting the active antennas among available ports. In this letter, we study the potential of shifting the integrated sensing and communication (ISAC) trade-off with FAS. We propose the model for FAS-enabled ISAC and jointly optimize the transmit beamforming and port selection of FAS. In particular, we aim to minimize the transmit power, while satisfying both communication and sensing requirements. An efficient iterative algorithm based on sparse optimization, convex approximation, and a penalty approach is developed. The simulation results show that the proposed scheme can attain 33% reductions in transmit power with guaranteed sensing and communication performance, showing the great potential of the fluid antenna for striking a flexible tradeoff between sensing and communication in ISAC systems.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Technical report on target classification in SAR track
Authors:
Haonan Xu,
Han Yinan,
Haotian Si,
Yang Yang
Abstract:
This report proposes a robust method for classifying oceanic and atmospheric phenomena using synthetic aperture radar (SAR) imagery. Our proposed method leverages the powerful pre-trained model Swin Transformer v2 Large as the backbone and employs carefully designed data augmentation and exponential moving average during training to enhance the model's generalization capability and stability. In t…
▽ More
This report proposes a robust method for classifying oceanic and atmospheric phenomena using synthetic aperture radar (SAR) imagery. Our proposed method leverages the powerful pre-trained model Swin Transformer v2 Large as the backbone and employs carefully designed data augmentation and exponential moving average during training to enhance the model's generalization capability and stability. In the testing stage, a method called ReAct is utilized to rectify activation values and utilize Energy Score for more accurate measurement of model uncertainty, significantly improving out-of-distribution detection performance. Furthermore, test time augmentation is employed to enhance classification accuracy and prediction stability. Comprehensive experimental results demonstrate that each additional technique significantly improves classification accuracy, confirming their effectiveness in classifying maritime and atmospheric phenomena in SAR imagery.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets
Authors:
Xuelong Geng,
Tianyi Xu,
Kun Wei,
Bingshen Mu,
Hongfei Xue,
He Wang,
Yangze Li,
Pengcheng Guo,
Yuhang Dai,
Longhao Li,
Mingchen Shao,
Lei Xie
Abstract:
Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configu…
▽ More
Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research.
△ Less
Submitted 6 May, 2024; v1 submitted 3 May, 2024;
originally announced May 2024.
-
Advancing low-field MRI with a universal denoising imaging transformer: Towards fast and high-quality imaging
Authors:
Zheren Zhu,
Azaan Rehman,
Xiaozhi Cao,
Congyu Liao,
Yoo ** Lee,
Michael Ohliger,
Hui Xue,
Yang Yang
Abstract:
Recent developments in low-field (LF) magnetic resonance imaging (MRI) systems present remarkable opportunities for affordable and widespread MRI access. A robust denoising method to overcome the intrinsic low signal-noise-ratio (SNR) barrier is critical to the success of LF MRI. However, current data-driven MRI denoising methods predominantly handle magnitude images and rely on customized models…
▽ More
Recent developments in low-field (LF) magnetic resonance imaging (MRI) systems present remarkable opportunities for affordable and widespread MRI access. A robust denoising method to overcome the intrinsic low signal-noise-ratio (SNR) barrier is critical to the success of LF MRI. However, current data-driven MRI denoising methods predominantly handle magnitude images and rely on customized models with constrained data diversity and quantity, which exhibit limited generalizability in clinical applications across diverse MRI systems, pulse sequences, and organs. In this study, we present ImT-MRD: a complex-valued imaging transformer trained on a vast number of clinical MRI scans aiming at universal MR denoising at LF systems. Compared with averaging multiple-repeated scans for higher image SNR, the model obtains better image quality from fewer repetitions, demonstrating its capability for accelerating scans under various clinical settings. Moreover, with its complex-valued image input, the model can denoise intermediate results before advanced post-processing and prepare high-quality data for further MRI research. By delivering universal and accurate denoising across clinical and research tasks, our model holds great promise to expedite the evolution of LF MRI for accessible and equal biomedical applications.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results
Authors:
Xin Li,
Kun Yuan,
Ya**g Pei,
Yiting Lu,
Ming Sun,
Chao Zhou,
Zhibo Chen,
Radu Timofte,
Wei Sun,
Haoning Wu,
Zicheng Zhang,
Jun Jia,
Zhichao Zhang,
Linhan Cao,
Qiubo Chen,
Xiongkuo Min,
Weisi Lin,
Guangtao Zhai,
Jianhui Sun,
Tianyi Wang,
Lei Li,
Han Kong,
Wenxuan Wang,
Bing Li,
Cheng Luo
, et al. (43 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The…
▽ More
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Yawei Li,
Nancy Mehta,
Radu Timofte,
Hongyuan Yu,
Cheng Wan,
Yuxin Hong,
Bingnan Han,
Zhuoyuan Wu,
Yajun Zou,
Yuqing Liu,
Jizhe Li,
Keji He,
Chao Fan,
Heng Zhang,
Xiaolin Zhang,
Xuanwu Yin,
Kunlong Zuo,
Bohao Liao,
Peizhe Xia,
Long Peng,
Zhibo Du,
Xin Di,
Wangkai Li,
Yang Wang
, et al. (109 additional authors not shown)
Abstract:
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such…
▽ More
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/.
△ Less
Submitted 25 June, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition
Authors:
Hainan Xu,
Zhehuai Chen,
Fei Jia,
Boris Ginsburg
Abstract:
This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that P…
▽ More
This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Inline AI: Open-source Deep Learning Inference for Cardiac MR
Authors:
Hui Xue,
Rhodri H Davies,
James Howard,
Hunain Shiwani,
Azaan Rehman,
Iain Pierce,
Henry Procter,
Marianna Fontana,
James C Moon,
Eylem Levelt,
Peter Kellman
Abstract:
Cardiac Magnetic Resonance (CMR) is established as a non-invasive imaging technique for evaluation of heart function, anatomy, and myocardial tissue characterization. Quantitative biomarkers are central for diagnosis and management of heart disease. Deep learning (DL) is playing an ever more important role in extracting these quantitative measures from CMR images. While many researchers have repor…
▽ More
Cardiac Magnetic Resonance (CMR) is established as a non-invasive imaging technique for evaluation of heart function, anatomy, and myocardial tissue characterization. Quantitative biomarkers are central for diagnosis and management of heart disease. Deep learning (DL) is playing an ever more important role in extracting these quantitative measures from CMR images. While many researchers have reported promising results in training and evaluating models, model deployment into the imaging workflow is less explored.
A new imaging AI framework, the InlineAI, was developed and open-sourced. The main innovation is to enable the model inference inline as a part of imaging computation, instead of as an offline post-processing step and to allow users to plug in their models. We demonstrate the system capability on three applications: long-axis CMR cine landmark detection, short-axis CMR cine analysis of function and anatomy, and quantitative perfusion map**.
The InlineAI allowed models to be deployed into imaging workflow in a streaming manner directly on the scanner. The model was loaded and inference on incoming images were performed while the data acquisition was ongoing, and results were sent back to scanner. Several biomarkers were extracted from model outputs in the demonstrated applications and reported as curves and tabular values. All processes are full automated. the model inference was completed within 6-45s after the end of imaging data acquisition.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Imaging transformer for MRI denoising with the SNR unit training: enabling generalization across field-strengths, imaging contrasts, and anatomy
Authors:
Hui Xue,
Sarah Hooper,
Azaan Rehman,
Iain Pierce,
Thomas Treibel,
Rhodri Davies,
W Patricia Bandettini,
Rajiv Ramasawmy,
Ahsan Javed,
Zheren Zhu,
Yang Yang,
James Moon,
Adrienne Campbell,
Peter Kellman
Abstract:
The ability to recover MRI signal from noise is key to achieve fast acquisition, accurate quantification, and high image quality. Past work has shown convolutional neural networks can be used with abundant and paired low and high-SNR images for training. However, for applications where high-SNR data is difficult to produce at scale (e.g. with aggressive acceleration, high resolution, or low field…
▽ More
The ability to recover MRI signal from noise is key to achieve fast acquisition, accurate quantification, and high image quality. Past work has shown convolutional neural networks can be used with abundant and paired low and high-SNR images for training. However, for applications where high-SNR data is difficult to produce at scale (e.g. with aggressive acceleration, high resolution, or low field strength), training a new denoising network using a large quantity of high-SNR images can be infeasible.
In this study, we overcome this limitation by improving the generalization of denoising models, enabling application to many settings beyond what appears in the training data. Specifically, we a) develop a training scheme that uses complex MRIs reconstructed in the SNR units (i.e., the images have a fixed noise level, SNR unit training) and augments images with realistic noise based on coil g-factor, and b) develop a novel imaging transformer (imformer) to handle 2D, 2D+T, and 3D MRIs in one model architecture. Through empirical evaluation, we show this combination improves performance compared to CNN models and improves generalization, enabling a denoising model to be used across field-strengths, image contrasts, and anatomy.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
A multi-stage semi-supervised learning for ankle fracture classification on CT images
Authors:
Hongzhi Liu,
Guicheng Li,
Jiacheng Nie,
Hui Tang,
Chunfeng Yang,
Qian** Feng,
Hailin Xu,
Yang Chen
Abstract:
Because of the complicated mechanism of ankle injury, it is very difficult to diagnose ankle fracture in clinic. In order to simplify the process of fracture diagnosis, an automatic diagnosis model of ankle fracture was proposed. Firstly, a tibia-fibula segmentation network is proposed for the joint tibiofibular region of the ankle joint, and the corresponding segmentation dataset is established o…
▽ More
Because of the complicated mechanism of ankle injury, it is very difficult to diagnose ankle fracture in clinic. In order to simplify the process of fracture diagnosis, an automatic diagnosis model of ankle fracture was proposed. Firstly, a tibia-fibula segmentation network is proposed for the joint tibiofibular region of the ankle joint, and the corresponding segmentation dataset is established on the basis of fracture data. Secondly, the image registration method is used to register the bone segmentation mask with the normal bone mask. Finally, a semi-supervised classifier is constructed to make full use of a large number of unlabeled data to classify ankle fractures. Experiments show that the proposed method can segment fractures with fracture lines accurately and has better performance than the general method. At the same time, this method is superior to classification network in several indexes.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
Adaptive Frequency Bin Interval in FFT via Dense Sampling Factor $α$
Authors:
Haichao Xu
Abstract:
The Fast Fourier Transform (FFT) is a fundamental tool for signal analysis, widely used across various fields. However, traditional FFT methods encounter challenges in adjusting the frequency bin interval, which may impede accurate spectral analysis. In this study, we propose a method for adjusting the frequency bin interval in FFT by introducing a parameter $α$. We elucidate the underlying princi…
▽ More
The Fast Fourier Transform (FFT) is a fundamental tool for signal analysis, widely used across various fields. However, traditional FFT methods encounter challenges in adjusting the frequency bin interval, which may impede accurate spectral analysis. In this study, we propose a method for adjusting the frequency bin interval in FFT by introducing a parameter $α$. We elucidate the underlying principles of the proposed method and discuss its potential applications across various contexts. Our findings suggest that the proposed method offers a promising approach to overcome the limitations of traditional FFT methods and enhance spectral analysis accuracy.
△ Less
Submitted 26 March, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.
-
Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution
Authors:
Qing** Zheng,
Ling Zheng,
Yuanfan Guo,
Ying Li,
Songcen Xu,
Jiankang Deng,
Hang Xu
Abstract:
Artifact-free super-resolution (SR) aims to translate low-resolution images into their high-resolution counterparts with a strict integrity of the original content, eliminating any distortions or synthetic details. While traditional diffusion-based SR techniques have demonstrated remarkable abilities to enhance image detail, they are prone to artifact introduction during iterative procedures. Such…
▽ More
Artifact-free super-resolution (SR) aims to translate low-resolution images into their high-resolution counterparts with a strict integrity of the original content, eliminating any distortions or synthetic details. While traditional diffusion-based SR techniques have demonstrated remarkable abilities to enhance image detail, they are prone to artifact introduction during iterative procedures. Such artifacts, ranging from trivial noise to unauthentic textures, deviate from the true structure of the source image, thus challenging the integrity of the super-resolution process. In this work, we propose Self-Adaptive Reality-Guided Diffusion (SARGD), a training-free method that delves into the latent space to effectively identify and mitigate the propagation of artifacts. Our SARGD begins by using an artifact detector to identify implausible pixels, creating a binary mask that highlights artifacts. Following this, the Reality Guidance Refinement (RGR) process refines artifacts by integrating this mask with realistic latent representations, improving alignment with the original image. Nonetheless, initial realistic-latent representations from lower-quality images result in over-smoothing in the final output. To address this, we introduce a Self-Adaptive Guidance (SAG) mechanism. It dynamically computes a reality score, enhancing the sharpness of the realistic latent. These alternating mechanisms collectively achieve artifact-free super-resolution. Extensive experiments demonstrate the superiority of our method, delivering detailed artifact-free high-resolution images while reducing sampling steps by 2X. We release our code at https://github.com/ProAirVerse/Self-Adaptive-Guidance-Diffusion.git.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
RadioGAT: A Joint Model-based and Data-driven Framework for Multi-band Radiomap Reconstruction via Graph Attention Networks
Authors:
Xiaojie Li,
Songyang Zhang,
Hang Li,
Xiaoyang Li,
Lexi Xu,
Haigao Xu,
Hui Mei,
Guangxu Zhu,
Nan Qi,
Ming Xiao
Abstract:
Multi-band radiomap reconstruction (MB-RMR) is a key component in wireless communications for tasks such as spectrum management and network planning. However, traditional machine-learning-based MB-RMR methods, which rely heavily on simulated data or complete structured ground truth, face significant deployment challenges. These challenges stem from the differences between simulated and actual data…
▽ More
Multi-band radiomap reconstruction (MB-RMR) is a key component in wireless communications for tasks such as spectrum management and network planning. However, traditional machine-learning-based MB-RMR methods, which rely heavily on simulated data or complete structured ground truth, face significant deployment challenges. These challenges stem from the differences between simulated and actual data, as well as the scarcity of real-world measurements. To address these challenges, our study presents RadioGAT, a novel framework based on Graph Attention Network (GAT) tailored for MB-RMR within a single area, eliminating the need for multi-region datasets. RadioGAT innovatively merges model-based spatial-spectral correlation encoding with data-driven radiomap generalization, thus minimizing the reliance on extensive data sources. The framework begins by transforming sparse multi-band data into a graph structure through an innovative encoding strategy that leverages radio propagation models to capture the spatial-spectral correlation inherent in the data. This graph-based representation not only simplifies data handling but also enables tailored label sampling during training, significantly enhancing the framework's adaptability for deployment. Subsequently, The GAT is employed to generalize the radiomap information across various frequency bands. Extensive experiments using raytracing datasets based on real-world environments have demonstrated RadioGAT's enhanced accuracy in supervised learning settings and its robustness in semi-supervised scenarios. These results underscore RadioGAT's effectiveness and practicality for MB-RMR in environments with limited data availability.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer
Authors:
Yu Xi,
Hao Li,
Baochen Yang,
Haoyu Li,
Hainan Xu,
Kai Yu
Abstract:
Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT…
▽ More
Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT-KWS, which leverages token-and-duration Transducers (TDT) for KWS tasks. We also propose a novel KWS task-specific decoding algorithm for Transducer-based models, which supports highly effective frame-asynchronous keyword search in streaming speech scenarios. With evaluations conducted on both the public Hey Snips and self-constructed LibriKWS-20 datasets, our proposed KWS-decoding algorithm produces more accurate results than conventional ASR decoding algorithms. Additionally, TDT-KWS achieves on-par or better wake word detection performance than both RNN-T and traditional TDT-ASR systems while achieving significant inference speed-up. Furthermore, experiments show that TDT-KWS is more robust to noisy environments compared to RNN-T KWS.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
On Performance of RIS-Aided Fluid Antenna Systems
Authors:
Farshad Rostami Ghadi,
Kai-Kit Wong,
Wee Kiat New,
Hao Xu,
Ross Murch,
Yangyang Zhang
Abstract:
This letter studies the performance of reconfigurable intelligent surface (RIS)-aided communications for a fluid antenna system (FAS) enabled receiver. Specifically, a fixed singleantenna base station (BS) transmits information through a RIS to a mobile user (MU) which is equipped with a planar fluid antenna in the absence of a direct link.We first analyze the spatial correlation structures among…
▽ More
This letter studies the performance of reconfigurable intelligent surface (RIS)-aided communications for a fluid antenna system (FAS) enabled receiver. Specifically, a fixed singleantenna base station (BS) transmits information through a RIS to a mobile user (MU) which is equipped with a planar fluid antenna in the absence of a direct link.We first analyze the spatial correlation structures among the positions (or ports) in the planar FAS, and then derive the joint distribution of the equivalent channel gain at the user by exploiting the central limit theorem. Furthermore, we obtain compact analytical expressions for the outage probability (OP) and delay outage rate (DOR). Numerical results illustrate that using FAS with only one activated port into the RIS-aided communication network can greatly enhance the performance, when compared to traditional antenna systems (TAS).
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
High-Performance Distributed Control for Large-Scale Linear Systems: A Partitioned Distributed Observer Approach
Authors:
Haotian Xu,
Shuai Liu,
Ling Shi
Abstract:
In recent years, the distributed-observer-based distributed control law has shown powerful ability to arbitrarily approximate the centralized control performance. However, the traditional distributed observer requires each local observer to reconstruct the state information of the whole system, which is unrealistic for large-scale scenarios. To fill this gap, this paper develops a greedy-idea-base…
▽ More
In recent years, the distributed-observer-based distributed control law has shown powerful ability to arbitrarily approximate the centralized control performance. However, the traditional distributed observer requires each local observer to reconstruct the state information of the whole system, which is unrealistic for large-scale scenarios. To fill this gap, this paper develops a greedy-idea-based large-scale system partition algorithm, which can significantly reduce the dimension of local observers. Then, the partitioned distributed observer for large-scale systems is proposed to overcome the problem that the system dynamics are difficult to estimate due to the coupling between partitions. Furthermore, the two-layer Lyapunov analysis method is adopted and the dynamic transformation lemma of compact errors is proven, which solves the problem of analyzing stability of the error dynamic of the partitioned distributed observer. Finally, it is proved that the distributed control law based on the partitioned distributed observer can also arbitrarily approximate the control performance of the centralized control law, and the dimension of the local observer is greatly reduced compared with the traditional method. The simulation results show that when the similarity between the physical network and the communication network is about 80%, the local observer dimension is greatly reduced by 90% and the relative error between the performance of the distributed control law and that of the centralized control law is less than 1%.
△ Less
Submitted 10 February, 2024;
originally announced February 2024.
-
Physical Layer Security over Fluid Antenna Systems
Authors:
Farshad Rostami Ghadi,
Kai-Kit Wong,
F. Javier Lopez-Martinez,
Wee Kiat New,
Hao Xu,
Chan-Byoung Chae
Abstract:
This paper investigates the performance of physical layer security (PLS) in fluid antenna-aided communication systems under arbitrary correlated fading channels. In particular, it is considered that a single fixed-antenna transmitter aims to send confidential information to a legitimate receiver equipped with a planar fluid antenna system (FAS), while an eavesdropper, also taking advantage of a pl…
▽ More
This paper investigates the performance of physical layer security (PLS) in fluid antenna-aided communication systems under arbitrary correlated fading channels. In particular, it is considered that a single fixed-antenna transmitter aims to send confidential information to a legitimate receiver equipped with a planar fluid antenna system (FAS), while an eavesdropper, also taking advantage of a planar FAS, attempts to decode the desired message. For this scenario, we first present analytical expressions of the equivalent channel distributions at the legitimate user and eavesdropper by using copula, so that the obtained analytical results are valid for any arbitrarily correlated fading distributions. Then, with the help of Gauss-Laguerre quadrature, we derive compact analytical expressions for the average secrecy capacity (ASC), the secrecy outage probability (SOP), and the secrecy energy efficiency (SEE) for the FAS wiretap channel. Moreover, for exemplary purposes, we also obtain the compact expression of ASC, SOP, and SEE by utilizing the Gaussian copula under correlated Rayleigh fading channels as a special case. Eventually, numerical results indicate that applying the fluid antenna with only one active port to PLS can guarantee more secure and reliable transmission, when compared to traditional antenna systems (TAS) exploiting maximal ratio combining (MRC).
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Streaming Sequence Transduction through Dynamic Compression
Authors:
Weiting Tan,
Yunmo Chen,
Tongfei Chen,
Guanghui Qin,
Haoran Xu,
Heidi C. Zhang,
Benjamin Van Durme,
Philipp Koehn
Abstract:
We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrat…
▽ More
We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Evaluation of Mean Shift, ComBat, and CycleGAN for Harmonizing Brain Connectivity Matrices Across Sites
Authors:
Hanliang Xu,
Nancy R. Newlin,
Michael E. Kim,
Chenyu Gao,
Praitayini Kanakaraj,
Aravind R. Krishnan,
Lucas W. Remedios,
Nazirah Mohd Khairi,
Kimberly Pechman,
Derek Archer,
Timothy J. Hohman,
Angela L. Jefferson,
The BIOCARD Study Team,
Ivana Isgum,
Yuankai Huo,
Daniel Moyer,
Kurt G. Schilling,
Bennett A. Landman
Abstract:
Connectivity matrices derived from diffusion MRI (dMRI) provide an interpretable and generalizable way of understanding the human brain connectome. However, dMRI suffers from inter-site and between-scanner variation, which impedes analysis across datasets to improve robustness and reproducibility of results. To evaluate different harmonization approaches on connectivity matrices, we compared graph…
▽ More
Connectivity matrices derived from diffusion MRI (dMRI) provide an interpretable and generalizable way of understanding the human brain connectome. However, dMRI suffers from inter-site and between-scanner variation, which impedes analysis across datasets to improve robustness and reproducibility of results. To evaluate different harmonization approaches on connectivity matrices, we compared graph measures derived from these matrices before and after applying three harmonization techniques: mean shift, ComBat, and CycleGAN. The sample comprises 168 age-matched, sex-matched normal subjects from two studies: the Vanderbilt Memory and Aging Project (VMAP) and the Biomarkers of Cognitive Decline Among Normal Individuals (BIOCARD). First, we plotted the graph measures and used coefficient of variation (CoV) and the Mann-Whitney U test to evaluate different methods' effectiveness in removing site effects on the matrices and the derived graph measures. ComBat effectively eliminated site effects for global efficiency and modularity and outperformed the other two methods. However, all methods exhibited poor performance when harmonizing average betweenness centrality. Second, we tested whether our harmonization methods preserved correlations between age and graph measures. All methods except for CycleGAN in one direction improved correlations between age and global efficiency and between age and modularity from insignificant to significant with p-values less than 0.05.
△ Less
Submitted 24 January, 2024; v1 submitted 8 January, 2024;
originally announced January 2024.
-
Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness
Authors:
Sicheng Yang,
Zunnan Xu,
Haiwei Xue,
Yongkang Cheng,
Shaoli Huang,
Mingming Gong,
Zhiyong Wu
Abstract:
Current talking avatars mostly generate co-speech gestures based on audio and text of the utterance, without considering the non-speaking motion of the speaker. Furthermore, previous works on co-speech gesture generation have designed network structures based on individual gesture datasets, which results in limited data volume, compromised generalizability, and restricted speaker movements. To tac…
▽ More
Current talking avatars mostly generate co-speech gestures based on audio and text of the utterance, without considering the non-speaking motion of the speaker. Furthermore, previous works on co-speech gesture generation have designed network structures based on individual gesture datasets, which results in limited data volume, compromised generalizability, and restricted speaker movements. To tackle these issues, we introduce FreeTalker, which, to the best of our knowledge, is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions. Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions, utilizing heterogeneous data sourced from various motion datasets. During inference, we utilize classifier-free guidance to highly control the style in the clips. Additionally, to create smooth transitions between clips, we utilize DoubleTake, a method that leverages a generative prior and ensures seamless motion blending. Extensive experiments show that our method generates natural and controllable speaker movements. Our code, model, and demo are are available at \url{https://youngseng.github.io/FreeTalker/}.
△ Less
Submitted 7 January, 2024;
originally announced January 2024.
-
Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation
Authors:
Huimeng Wang,
Zengrui **,
Mengzhe Geng,
Shujie Hu,
Guinan Li,
Tianzi Wang,
Haoning Xu,
Xunying Liu
Abstract:
Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained ASR models to limited dysarthric speech via data-intensive parameter fine-tuning leads to poor generalization. To this end, this paper presents an ext…
▽ More
Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained ASR models to limited dysarthric speech via data-intensive parameter fine-tuning leads to poor generalization. To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. These include: a) conventional speaker-independent perturbation of impaired speech; b) speaker-dependent speed perturbation, or GAN-based adversarial perturbation of normal, control speech based on their time alignment against parallel dysarthric speech; c) novel Spectral basis GAN-based adversarial data augmentation operating on non-parallel data. Experiments conducted on the UASpeech corpus suggest GAN-based data augmentation consistently outperforms fine-tuned Wav2vec2.0 and HuBERT models using no data augmentation and speed perturbation across different data expansion operating points by statistically significant word error rate (WER) reductions up to 2.01% and 0.96% absolute (9.03% and 4.63% relative) respectively on the UASpeech test set of 16 dysarthric speakers. After cross-system outputs rescoring, the best system produced the lowest published WER of 16.53% (46.47% on very low intelligibility) on UASpeech.
△ Less
Submitted 31 December, 2023;
originally announced January 2024.
-
E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models
Authors:
Hongfei Xue,
Yuhao Liang,
Bingshen Mu,
Shiliang Zhang,
Mengzhe Chen,
Qian Chen,
Lei Xie
Abstract:
This study focuses on emotion-sensitive spoken dialogue in human-machine speech interaction. With the advancement of Large Language Models (LLMs), dialogue systems can handle multimodal data, including audio. Recent models have enhanced the understanding of complex audio signals through the integration of various audio events. However, they are unable to generate appropriate responses based on emo…
▽ More
This study focuses on emotion-sensitive spoken dialogue in human-machine speech interaction. With the advancement of Large Language Models (LLMs), dialogue systems can handle multimodal data, including audio. Recent models have enhanced the understanding of complex audio signals through the integration of various audio events. However, they are unable to generate appropriate responses based on emotional speech. To address this, we introduce the Emotional chat Model (E-chat), a novel spoken dialogue system capable of comprehending and responding to emotions conveyed from speech. This model leverages an emotion embedding extracted by a speech encoder, combined with LLMs, enabling it to respond according to different emotional contexts. Additionally, we introduce the E-chat200 dataset, designed explicitly for emotion-sensitive spoken dialogue. In various evaluation metrics, E-chat consistently outperforms baseline LLMs, demonstrating its potential in emotional comprehension and human-machine interaction.
△ Less
Submitted 6 January, 2024; v1 submitted 31 December, 2023;
originally announced January 2024.
-
Reconfigurable Frequency Multipliers Based on Complementary Ferroelectric Transistors
Authors:
Haotian Xu,
Jianyi Yang,
Cheng Zhuo,
Thomas Kämpfe,
Kai Ni,
Xunzhao Yin
Abstract:
Frequency multipliers, a class of essential electronic components, play a pivotal role in contemporary signal processing and communication systems. They serve as crucial building blocks for generating high-frequency signals by multiplying the frequency of an input signal. However, traditional frequency multipliers that rely on nonlinear devices often require energy- and area-consuming filtering an…
▽ More
Frequency multipliers, a class of essential electronic components, play a pivotal role in contemporary signal processing and communication systems. They serve as crucial building blocks for generating high-frequency signals by multiplying the frequency of an input signal. However, traditional frequency multipliers that rely on nonlinear devices often require energy- and area-consuming filtering and amplification circuits, and emerging designs based on an ambipolar ferroelectric transistor require costly non-trivial characteristic tuning or complex technology process. In this paper, we show that a pair of standard ferroelectric field effect transistors (FeFETs) can be used to build compact frequency multipliers without aforementioned technology issues. By leveraging the tunable parabolic shape of the 2FeFET structures' transfer characteristics, we propose four reconfigurable frequency multipliers, which can switch between signal transmission and frequency doubling. Furthermore, based on the 2FeFET structures, we propose four frequency multipliers that realize triple, quadruple frequency modes, elucidating a scalable methodology to generate more multiplication harmonics of the input frequency. Performance metrics such as maximum operating frequency, power, etc., are evaluated and compared with existing works. We also implement a practical case of frequency modulation scheme based on the proposed reconfigurable multipliers without additional devices. Our work provides a novel path of scalable and reconfigurable frequency multiplier designs based on devices that have characteristics similar to FeFETs, and show that FeFETs are a promising candidate for signal processing and communication systems in terms of maximum operating frequency and power.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
A new type of window functions constructed with exponential function
Authors:
Haichao Xu,
Xingpao Suo
Abstract:
The Discrete Fourier Transform (DFT) is widely utilized for signal analysis but is plagued by spectral leakage, leading to inaccuracies in signal approximation. Window functions play a crucial role in mitigating spectral leakage by providing weighting mechanisms for discrete signals. In this paper, we introduce a novel window type based on exponential function, allowing for adjustable parameters a…
▽ More
The Discrete Fourier Transform (DFT) is widely utilized for signal analysis but is plagued by spectral leakage, leading to inaccuracies in signal approximation. Window functions play a crucial role in mitigating spectral leakage by providing weighting mechanisms for discrete signals. In this paper, we introduce a novel window type based on exponential function, allowing for adjustable parameters and diverse variations. We present the formulation, properties, and motivation behind the design of the new window functions. Additionally, we analyze their behavior and evaluate their performance by comparing them with mainstream window functions using six parameters. Our findings demonstrate that these new window functions exhibit outstanding flexibility and versatility in signal analysis.
△ Less
Submitted 23 December, 2023;
originally announced December 2023.
-
Rotational Augmented Noise2Inverse for Low-dose Computed Tomography Reconstruction
Authors:
Hang Xu,
Alessandro Perelli
Abstract:
In this work, we present a novel self-supervised method for Low Dose Computed Tomography (LDCT) reconstruction. Reducing the radiation dose to patients during a CT scan is a crucial challenge since the quality of the reconstruction highly degrades because of low photons or limited measurements. Supervised deep learning methods have shown the ability to remove noise in images but require accurate g…
▽ More
In this work, we present a novel self-supervised method for Low Dose Computed Tomography (LDCT) reconstruction. Reducing the radiation dose to patients during a CT scan is a crucial challenge since the quality of the reconstruction highly degrades because of low photons or limited measurements. Supervised deep learning methods have shown the ability to remove noise in images but require accurate ground truth which can be obtained only by performing additional high-radiation CT scans. Therefore, we propose a novel self-supervised framework for LDCT, in which ground truth is not required for training the convolutional neural network (CNN). Based on the Noise2Inverse (N2I) method, we enforce in the training loss the equivariant property of rotation transformation, which is induced by the CT imaging system, to improve the quality of the CT image in a lower dose. Numerical and experimental results show that the reconstruction accuracy of N2I with sparse views is degrading while the proposed rotational augmented Noise2Inverse (RAN2I) method keeps better image quality over a different range of sampling angles. Finally, the quantitative results demonstrate that RAN2I achieves higher image quality compared to N2I, and experimental results of RAN2I on real projection data show comparable performance to supervised learning.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Fluid Antenna-Assisted MIMO Transmission Exploiting Statistical CSI
Authors:
Yuqi Ye,
Li You,
Jue Wang,
Hao Xu,
Kai-Kit Wong,
Xiqi Gao
Abstract:
In conventional multiple-input multiple-output (MIMO) communication systems, the positions of antennas are fixed. To take full advantage of spatial degrees of freedom, a new technology called fluid antenna (FA) is proposed to obtain higher achievable rate and diversity gain. Most existing works on FA exploit instantaneous channel state information (CSI). However, in FA-assisted systems, it is diff…
▽ More
In conventional multiple-input multiple-output (MIMO) communication systems, the positions of antennas are fixed. To take full advantage of spatial degrees of freedom, a new technology called fluid antenna (FA) is proposed to obtain higher achievable rate and diversity gain. Most existing works on FA exploit instantaneous channel state information (CSI). However, in FA-assisted systems, it is difficult to obtain instantaneous CSI since changes in the antenna position will lead to channel variation. In this letter, we investigate a FA-assisted MIMO system using relatively slow-varying statistical CSI. Specifically, in the criterion of rate maximization, we propose an algorithmic framework for transmit precoding and transmit/receive FAs position designs with statistical CSI. Simulation results show that our proposed algorithm in FA-assisted systems significantly outperforms baselines in terms of rate performance.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
A Scalable Network-Aware Multi-Agent Reinforcement Learning Framework for Decentralized Inverter-based Voltage Control
Authors:
Han Xu,
Jialin Zheng,
Guannan Qu
Abstract:
This paper addresses the challenges associated with decentralized voltage control in power grids due to an increase in distributed generations (DGs). Traditional model-based voltage control methods struggle with the rapid energy fluctuations and uncertainties of these DGs. While multi-agent reinforcement learning (MARL) has shown potential for decentralized secondary control, scalability issues ar…
▽ More
This paper addresses the challenges associated with decentralized voltage control in power grids due to an increase in distributed generations (DGs). Traditional model-based voltage control methods struggle with the rapid energy fluctuations and uncertainties of these DGs. While multi-agent reinforcement learning (MARL) has shown potential for decentralized secondary control, scalability issues arise when dealing with a large number of DGs. This problem lies in the dominant centralized training and decentralized execution (CTDE) framework, where the critics take global observations and actions. To overcome these challenges, we propose a scalable network-aware (SNA) framework that leverages network structure to truncate the input to the critic's Q-function, thereby improving scalability and reducing communication costs during training. Further, the SNA framework is theoretically grounded with provable approximation guarantee, and it can seamlessly integrate with multiple multi-agent actor-critic algorithms. The proposed SNA framework is successfully demonstrated in a system with 114 DGs, providing a promising solution for decentralized voltage control in increasingly complex power grid systems.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Channel Estimation for FAS-assisted Multiuser mmWave Systems
Authors:
Hao Xu,
Gui Zhou,
Kai-Kit Wong,
Wee Kiat New,
Chao Wang,
Chan-Byoung Chae,
Ross Murch,
Shi **,
Yangyang Zhang
Abstract:
This letter investigates the challenge of channel estimation in a multiuser millimeter-wave (mmWave) time-division duplexing (TDD) system. In this system, the base station (BS) employs a multi-antenna uniform linear array (ULA), while each mobile user is equipped with a fluid antenna system (FAS). Accurate channel state information (CSI) plays a crucial role in the precise placement of antennas in…
▽ More
This letter investigates the challenge of channel estimation in a multiuser millimeter-wave (mmWave) time-division duplexing (TDD) system. In this system, the base station (BS) employs a multi-antenna uniform linear array (ULA), while each mobile user is equipped with a fluid antenna system (FAS). Accurate channel state information (CSI) plays a crucial role in the precise placement of antennas in FAS. Traditional channel estimation methods designed for fixed-antenna systems are inadequate due to the high dimensionality of FAS. To address this issue, we propose a low-sample-size sparse channel reconstruction (L3SCR) method, capitalizing on the sparse propagation paths characteristic of mmWave channels. In this approach, each fluid antenna only needs to switch and measure the channel at a few specific locations. By observing this reduced-dimensional data, we can effectively extract angular and gain information related to the sparse channel, enabling us to reconstruct the full CSI. Simulation results demonstrate that our proposed method allows us to obtain precise CSI with minimal hardware switching and pilot overhead. As a result, the system sum-rate approaches the upper bound achievable with perfect CSI.
△ Less
Submitted 3 January, 2024; v1 submitted 18 November, 2023;
originally announced November 2023.
-
An Event-Based Synchronization Framework for Controller Hardware-in-the-loop Simulation of Electric Railway Power Electronics Systems
Authors:
Jialin Zheng,
Yangbin Zeng,
Han Xu,
Weicheng Liu,
Di Mou,
Zhengming Zhao
Abstract:
The Controller Hardware_in_the_loop (CHIL) simulation is gaining popularity as a cost_effective, efficient, and reliable tool in the design and development process of fast_growing electrified transportation power converters. However, it is challenging to implement the conventional CHIL simulations on the railway power converters with complex topologies and high switching frequencies due to strict…
▽ More
The Controller Hardware_in_the_loop (CHIL) simulation is gaining popularity as a cost_effective, efficient, and reliable tool in the design and development process of fast_growing electrified transportation power converters. However, it is challenging to implement the conventional CHIL simulations on the railway power converters with complex topologies and high switching frequencies due to strict real_time constraints. Therefore, this paper proposes an event-based synchronization CHIL (ES_CHIL) framework for high_fidelity simulation of these electrified railway power converters. Different from conventional CHIL simulations synchronized through the time axis, the ES_CHIL framework is synchronized through the event axis. Therefore, it can ease the real_time constraint and broaden the upper bound on the system size and switching frequency. Besides, models and algorithms with higher accuracy, such as the diode model with natural commutation processes, can be used in the ES-CHIL framework. The proposed framework is validated for a 350 kW wireless power transformer system containing 24 fully controlled devices and 36 diodes by comparing it with Simulink and physical experiments. This research improves the fidelity and application range of the power converters CHIL simulation. Thus, it helps to accelerate the prototype design and performance evaluation process for electrified railways and other applications with such complex converters.
△ Less
Submitted 12 November, 2023;
originally announced November 2023.
-
Accurate Time-segmented Loss Model for SiC MOSFETs in Electro-thermal Multi-Rate Simulation
Authors:
Jialin Zheng,
Zhengming Zhao,
Han Xu,
Weicheng Liu,
Yangbin Zeng
Abstract:
Compared with silicon (Si) power devices, Silicon carbide (SiC) devices have the advantages of fast switching speed and low on-resistance. However, the effects of non-ideal characteristics of SiC MOSFETs and stray parameters (especially parasitic inductance) on switching losses need to be further evaluated. In this paper, a transient loss model based on SiC MOSFET and SiC Schottky barrier diode (S…
▽ More
Compared with silicon (Si) power devices, Silicon carbide (SiC) devices have the advantages of fast switching speed and low on-resistance. However, the effects of non-ideal characteristics of SiC MOSFETs and stray parameters (especially parasitic inductance) on switching losses need to be further evaluated. In this paper, a transient loss model based on SiC MOSFET and SiC Schottky barrier diode (SBD) switching pairs is proposed. The transient process analysis is simplified by time segmentation of the transient process of power switching devices. The electro-thermal simulation calculates the junction temperature and updates the temperature-related parameters with the proposed loss model and the thermal network model. A multi-rate data exchange strategy is proposed to solve the problem of disparity in timescales between circuit simulation and thermal network simulation. The CREE CMF20120D SiC MOSFET device is used for the experimental verification. The experimental results verify the accuracy of the model which provides guidance for the circuit design of SiC MOSFETs. All the parameters of the loss model can be extracted from the datasheet, which is practical in power electronics design.
△ Less
Submitted 12 November, 2023;
originally announced November 2023.
-
Sum-Rate Optimization for RIS-Aided Multiuser Communications with Movable Antenna
Authors:
Yunan Sun,
Hao Xu,
Chongjun Ouyang,
Hongwen Yang
Abstract:
Reconfigurable intelligent surface (RIS) is known as a promising technology to improve the performance of wireless communication networks, which has been extensively studied. Movable antenna (MA) is a novel technology that fully exploits the antenna position for enhancing the channel capacity. In this paper, we propose a new RIS-aided multiuser communication system with MAs. The sum-rate is maximi…
▽ More
Reconfigurable intelligent surface (RIS) is known as a promising technology to improve the performance of wireless communication networks, which has been extensively studied. Movable antenna (MA) is a novel technology that fully exploits the antenna position for enhancing the channel capacity. In this paper, we propose a new RIS-aided multiuser communication system with MAs. The sum-rate is maximized by jointly optimizing the beamforming, the reflection coefficient (RC) values of RIS and the positions of MAs. A fractional programming-based iterative algorithm is proposed to solve the formulated non-convex problem, considering three assumptions for the RIS. Numerical results are presented to verify the effectiveness of the proposed algorithm and the superiority of the proposed MA-based system in terms of sum-rate.
△ Less
Submitted 11 November, 2023;
originally announced November 2023.
-
FPGA-Based Implicit-Explicit Real-time Simulation Solver for Railway Wireless Power Transfer with Nonlinear Magnetic Coupling Components
Authors:
Han Xu,
Yangbin Zeng,
Jialin Zheng,
Kainan Chen,
Weicheng Liu,
Zhengming Zhao
Abstract:
Railway Wireless Power Transfer (WPT) is a promising non-contact power supply solution, but constructing prototypes for controller testing can be both costly and unsafe. Real-time hardware-in-the-loop simulation is an effective and secure testing tool, but simulating the dynamic charging process of railway WPT systems is challenging due to the continuous changes in the nonlinear magnetic coupling…
▽ More
Railway Wireless Power Transfer (WPT) is a promising non-contact power supply solution, but constructing prototypes for controller testing can be both costly and unsafe. Real-time hardware-in-the-loop simulation is an effective and secure testing tool, but simulating the dynamic charging process of railway WPT systems is challenging due to the continuous changes in the nonlinear magnetic coupling components. To address this challenge, we propose an FPGA-based half-step implicit-explicit (IMEX) simulation solver. The proposed solver adopts an IMEX algorithm to solve the piecewise linear and nonlinear parts of the system separately, which enables FPGAs to solve nonlinear components while achieving high numerical stability. Additionally, we divide a complete integration step into two half-steps to reduce computational time delays. Our proposed method offers a promising solution for the real-time simulation of railway WPT systems. The novelty of our approach lies in the use of the IMEX algorithm and the half-step integration method, which significantly improves the accuracy and efficiency of the simulation. Our simulations and experiments demonstrate the effectiveness and accuracy of the proposed solver, which provides a new approach for simulating and optimizing railway WPT systems with nonlinear magnetic coupling components.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
Numerical Derivative-based Flexible Integration Algorithm for Power Electronic Systems Simulation Considering Nonlinear Components
Authors:
Han Xu,
Bochen Shi,
Zhujun Yu,
Jialin Zheng,
Zhengming Zhao
Abstract:
Simulation is an efficient tool in the design and control of power electronic systems. However, quick and accurate simulation of them is still challenging, especially when the system contains a large number of switches and state variables. Conventional general-purpose integration algorithms assume nonlinearity within systems but face inefficiency in handling the piecewise characteristics of power…
▽ More
Simulation is an efficient tool in the design and control of power electronic systems. However, quick and accurate simulation of them is still challenging, especially when the system contains a large number of switches and state variables. Conventional general-purpose integration algorithms assume nonlinearity within systems but face inefficiency in handling the piecewise characteristics of power electronic switches. While some specialized algorithms can adapt to the piecewise characteristics, most of these methods require systems to be piecewise linear. In this article, a numerical derivative-based flexible integration algorithm is proposed. This algorithm can adapt to the piecewise characteristic caused by switches and have no difficulty when nonlinear non-switching components are present in the circuit. This algorithm consists of a recursive numerical scheme that obtains high-order time derivatives of nonlinear components and a decoupling strategy that further increases computational efficiency. The proposed method is applied to solve a motor derive system and a large-scale power conversion system (PCS) to verify its accuracy and efficiency by comparing experimental waveforms and simulated results given by commercial software. Our proposed method demonstrates several-fold acceleration compared to multiple commonly used algorithms in Simulink.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Low-latency Speech Enhancement via Speech Token Generation
Authors:
Huaying Xue,
Xiulian Peng,
Yan Lu
Abstract:
Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement…
▽ More
Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.
△ Less
Submitted 23 January, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
BA-MoE: Boundary-Aware Mixture-of-Experts Adapter for Code-Switching Speech Recognition
Authors:
Peikun Chen,
Fan Yu,
Yuhao Lian,
Hongfei Xue,
Xucheng Wan,
Naijun Zheng,
Huan Zhou,
Lei Xie
Abstract:
Mixture-of-experts based models, which use language experts to extract language-specific representations effectively, have been well applied in code-switching automatic speech recognition. However, there is still substantial space to improve as similar pronunciation across languages may result in ineffective multi-language modeling and inaccurate language boundary estimation. To eliminate these dr…
▽ More
Mixture-of-experts based models, which use language experts to extract language-specific representations effectively, have been well applied in code-switching automatic speech recognition. However, there is still substantial space to improve as similar pronunciation across languages may result in ineffective multi-language modeling and inaccurate language boundary estimation. To eliminate these drawbacks, we propose a cross-layer language adapter and a boundary-aware training method, namely Boundary-Aware Mixture-of-Experts (BA-MoE). Specifically, we introduce language-specific adapters to separate language-specific representations and a unified gating layer to fuse representations within each encoder layer. Second, we compute language adaptation loss of the mean output of each language-specific adapter to improve the adapter module's language-specific representation learning. Besides, we utilize a boundary-aware predictor to learn boundary representations for dealing with language boundary confusion. Our approach achieves significant performance improvement, reducing the mixture error rate by 16.55\% compared to the baseline on the ASRU 2019 Mandarin-English code-switching challenge dataset.
△ Less
Submitted 7 October, 2023; v1 submitted 4 October, 2023;
originally announced October 2023.
-
SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition
Authors:
Hongfei Xue,
Qijie Shao,
Kaixun Huang,
Peikun Chen,
Jie Liu,
Lei Xie
Abstract:
Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, w…
▽ More
Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model. We first analyze the different layers of MMS and show that the middle layers capture language-related information, and the high layers encode content-related information, which gradually decreases in the final layers. Then, we extract a language-related frame from correlated middle layers and guide specific language extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance.
△ Less
Submitted 27 April, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition
Authors:
Dongji Gao,
Hainan Xu,
Desh Raj,
Leibny Paola Garcia Perera,
Daniel Povey,
Sanjeev Khudanpur
Abstract:
Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. Thi…
▽ More
Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts. OTC extends the conventional CTC objective for imperfect transcripts by leveraging weighted finite state transducers. Through experiments conducted on the LibriSpeech and LibriVox datasets, we demonstrate that training ASR models with OTC avoids performance degradation even with transcripts containing up to 70% errors, a scenario where CTC models fail completely. Our implementation is available at https://github.com/k2-fsa/icefall.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Learning Model Predictive Control with Error Dynamics Regression for Autonomous Racing
Authors:
Haoru Xue,
Edward L. Zhu,
John M. Dolan,
Francesco Borrelli
Abstract:
This work presents a novel Learning Model Predictive Control (LMPC) strategy for autonomous racing at the handling limit that can iteratively explore and learn unknown dynamics in high-speed operational domains. We start from existing LMPC formulations and modify the system dynamics learning method. In particular, our approach uses a nominal, global, nonlinear, physics-based model with a local, li…
▽ More
This work presents a novel Learning Model Predictive Control (LMPC) strategy for autonomous racing at the handling limit that can iteratively explore and learn unknown dynamics in high-speed operational domains. We start from existing LMPC formulations and modify the system dynamics learning method. In particular, our approach uses a nominal, global, nonlinear, physics-based model with a local, linear, data-driven learning of the error dynamics. We conducted experiments in simulation and on 1/10th scale hardware, and deployed the proposed LMPC on a full-scale autonomous race car used in the Indy Autonomous Challenge (IAC) with closed loop experiments at the Putnam Park Road Course in Indiana, USA. The results show that the proposed control policy exhibits improved robustness to parameter tuning and data scarcity. Incremental and safety-aware exploration toward the limit of handling and iterative learning of the vehicle dynamics in high-speed domains is observed both in simulations and experiments.
△ Less
Submitted 7 March, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.