Search | arXiv e-print repository

AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection

Authors: Anbai Jiang, Bing Han, Zhiqiang Lv, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, **yi Fan

Abstract: Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, res… ▽ More Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, resulting in inconsistency in data and architecture. Thus, we propose AnoPatch which utilizes a ViT backbone pre-trained on AudioSet and fine-tunes it on machine audio. It is believed that machine audio is more related to audio datasets than speech datasets, and modeling it from patch level suits the sparsity of machine audio. As a result, AnoPatch showcases state-of-the-art (SOTA) performances on the DCASE 2020 ASD dataset and the DCASE 2023 ASD dataset. We also compare multiple pre-trained models and empirically demonstrate that better consistency yields considerable improvement. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2406.03714 [pdf, other]

Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining

Authors: **long Xue, Yayue Deng, Yingming Gao, Ya Li

Abstract: Recent prompt-based text-to-speech (TTS) models can clone an unseen speaker using only a short speech prompt. They leverage a strong in-context ability to mimic the speech prompts, including speaker style, prosody, and emotion. Therefore, the selection of a speech prompt greatly influences the generated speech, akin to the importance of a prompt in large language models (LLMs). However, current pr… ▽ More Recent prompt-based text-to-speech (TTS) models can clone an unseen speaker using only a short speech prompt. They leverage a strong in-context ability to mimic the speech prompts, including speaker style, prosody, and emotion. Therefore, the selection of a speech prompt greatly influences the generated speech, akin to the importance of a prompt in large language models (LLMs). However, current prompt-based TTS models choose the speech prompt manually or simply at random. Hence, in this paper, we adapt retrieval augmented generation (RAG) from LLMs to prompt-based TTS. Unlike traditional RAG methods, we additionally consider contextual information during the retrieval process and present a Context-Aware Contrastive Language-Audio Pre-training (CA-CLAP) model to extract context-aware, style-related features. The objective and subjective evaluations demonstrate that our proposed RAG method outperforms baselines, and our CA-CLAP achieves better results than text-only retrieval methods. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.03706 [pdf, other]

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

Authors: **long Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li

Abstract: Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they only support short speech prompts and cannot leverage longer context information, as required in audiobook and conversational TTS scenarios. In this paper, we intr… ▽ More Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they only support short speech prompts and cannot leverage longer context information, as required in audiobook and conversational TTS scenarios. In this paper, we introduce a novel audio codec-based TTS model to adapt context features with multiple enhancements. Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer (MMCE-Qformer) to utilize additional multi-modal context information. Besides, we adapt a pretrained LLM to leverage its understanding ability to predict semantic tokens, and use a SoundStorm to generate acoustic tokens thereby enhancing audio quality and speaker similarity. The extensive objective and subjective evaluations show that our proposed method outperforms baselines across various context TTS scenarios. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2405.00603 [pdf, other]

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

Authors: Yimin Deng, Jianzong Wang, Xulong Zhang, Ning Cheng, **g Xiao

Abstract: Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issue… ▽ More Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

arXiv:2404.01654 [pdf, other]

AI WALKUP: A Computer-Vision Approach to Quantifying MDS-UPDRS in Parkinson's Disease

Authors: Xiang Xiang, Zihan Zhang, **g Ma, Yao Deng

Abstract: Parkinson's Disease (PD) is the second most common neurodegenerative disorder. The existing assessment method for PD is usually the Movement Disorder Society - Unified Parkinson's Disease Rating Scale (MDS-UPDRS) to assess the severity of various types of motor symptoms and disease progression. However, manual assessment suffers from high subjectivity, lack of consistency, and high cost and low ef… ▽ More Parkinson's Disease (PD) is the second most common neurodegenerative disorder. The existing assessment method for PD is usually the Movement Disorder Society - Unified Parkinson's Disease Rating Scale (MDS-UPDRS) to assess the severity of various types of motor symptoms and disease progression. However, manual assessment suffers from high subjectivity, lack of consistency, and high cost and low efficiency of manual communication. We want to use a computer vision based solution to capture human pose images based on a camera, reconstruct and perform motion analysis using algorithms, and extract the features of the amount of motion through feature engineering. The proposed approach can be deployed on different smartphones, and the video recording and artificial intelligence analysis can be done quickly and easily through our APP. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: Technical report for AI WALKUP, an APP winning 3rd Prize of 2022 HUST GS AI Innovation and Design Competition

arXiv:2403.17392 [pdf, other]

Natural-artificial hybrid swarm: Cyborg-insect group navigation in unknown obstructed soft terrain

Authors: Yang Bai, Phuoc Thanh Tran Ngoc, Huu Duoc Nguyen, Duc Long Le, Quang Huy Ha, Kazuki Kai, Yu Xiang See To, Yaosheng Deng, Jie Song, Naoki Wakamiya, Hirotaka Sato, Masaki Ogura

Abstract: Navigating multi-robot systems in complex terrains has always been a challenging task. This is due to the inherent limitations of traditional robots in collision avoidance, adaptation to unknown environments, and sustained energy efficiency. In order to overcome these limitations, this research proposes a solution by integrating living insects with miniature electronic controllers to enable roboti… ▽ More Navigating multi-robot systems in complex terrains has always been a challenging task. This is due to the inherent limitations of traditional robots in collision avoidance, adaptation to unknown environments, and sustained energy efficiency. In order to overcome these limitations, this research proposes a solution by integrating living insects with miniature electronic controllers to enable robotic-like programmable control, and proposing a novel control algorithm for swarming. Although these creatures, called cyborg insects, have the ability to instinctively avoid collisions with neighbors and obstacles while adapting to complex terrains, there is a lack of literature on the control of multi-cyborg systems. This research gap is due to the difficulty in coordinating the movements of a cyborg system under the presence of insects' inherent individual variability in their reactions to control input. In response to this issue, we propose a novel swarm navigation algorithm addressing these challenges. The effectiveness of the algorithm is demonstrated through an experimental validation in which a cyborg swarm was successfully navigated through an unknown sandy field with obstacles and hills. This research contributes to the domain of swarm robotics and showcases the potential of integrating biological organisms with robotics and control theory to create more intelligent autonomous systems with real-world applications. △ Less

Submitted 27 March, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

arXiv:2402.16027 [pdf, other]

Enhancing xURLLC with RSMA-Assisted Massive-MIMO Networks: Performance Analysis and Optimization

Authors: Yuang Chen, Hancheng Lu, Chenwu Zhang, Yansha Deng, Arumugam Nallanathan

Abstract: Massive interconnection has sparked people's envisioning for next-generation ultra-reliable and low-latency communications (xURLLC), prompting the design of customized next-generation advanced transceivers (NGAT). Rate-splitting multiple access (RSMA) has emerged as a pivotal technology for NGAT design, given its robustness to imperfect channel state information (CSI) and resilience to quality of… ▽ More Massive interconnection has sparked people's envisioning for next-generation ultra-reliable and low-latency communications (xURLLC), prompting the design of customized next-generation advanced transceivers (NGAT). Rate-splitting multiple access (RSMA) has emerged as a pivotal technology for NGAT design, given its robustness to imperfect channel state information (CSI) and resilience to quality of service (QoS). Additionally, xURLLC urgently appeals to large-scale access techniques, thus massive multiple-input multiple-output (mMIMO) is anticipated to integrate with RSMA to enhance xURLLC. In this paper, we develop an innovative RSMA-assisted massive-MIMO xURLLC (RSMA-mMIMO-xURLLC) network architecture tailored to accommodate xURLLC's critical QoS constraints in finite blocklength (FBL) regimes. Leveraging uplink pilot training under imperfect CSI at the transmitter, we estimate channel gains and customize linear precoders for efficient downlink short-packet data transmission. Subsequently, we formulate a joint rate-splitting, beamforming, and transmit antenna selection optimization problem to maximize the total effective transmission rate (ETR). Addressing this multi-variable coupled non-convex problem, we decompose it into three corresponding subproblems and propose a low-complexity joint iterative algorithm for efficient optimization. Extensive simulations substantiate that compared with non-orthogonal multiple access (NOMA) and space division multiple access (SDMA), the developed architecture improves the total ETR by 15.3% and 41.91%, respectively, as well as accommodates larger-scale access. △ Less

Submitted 25 February, 2024; originally announced February 2024.

Comments: 14 pages, 11 figures, Submitted to IEEE for potential publication

arXiv:2402.11478 [pdf, other]

Federated Reinforcement Learning for Uplink Centric Broadband Communication Optimization over Unlicensed Spectrum

Authors: Hui Zhou, Yansha Deng

Abstract: To provide Uplink Centric Broadband Communication (UCBC), New Radio Unlicensed (NR-U) network has been standardized to exploit the unlicensed spectrum using Listen Before Talk (LBT) scheme to fairly coexist with the incumbent Wireless Fidelity (WiFi) network. Existing access schemes over unlicensed spectrum are required to perform Clear Channel Assessment (CCA) before transmissions, where fixed En… ▽ More To provide Uplink Centric Broadband Communication (UCBC), New Radio Unlicensed (NR-U) network has been standardized to exploit the unlicensed spectrum using Listen Before Talk (LBT) scheme to fairly coexist with the incumbent Wireless Fidelity (WiFi) network. Existing access schemes over unlicensed spectrum are required to perform Clear Channel Assessment (CCA) before transmissions, where fixed Energy Detection (ED) thresholds are adopted to identify the channel as idle or busy. However, fixed ED thresholds setting prevents devices from accessing the channel effectively and efficiently, which leads to the hidden node (HN) and exposed node (EN) problems. In this paper, we first develop a centralized double Deep Q-Network (DDQN) algorithm to optimize the uplink system throughput, where the agent is deployed at the central server to dynamically adjust the ED thresholds for NR-U and WiFi networks. Considering that heterogeneous NR-U and WiFi networks, in practice, cannot share the raw data with the central server directly, we then develop a federated DDQN algorithm, where two agents are deployed in the NR-U and WiFi networks, respectively. Our results have shown that the uplink system throughput increases by over 100%, where cell throughput of NR-U network rises by 150%, and cell throughput of WiFi network decreases by 30%. To guarantee the cell throughput of WiFi network, we redesign the reward function to punish the agent when the cell throughput of WiFi network is below the threshold, and our revised design can still provide over 50% uplink system throughput gain. △ Less

Submitted 18 February, 2024; originally announced February 2024.

arXiv:2401.08096 [pdf, other]

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Authors: Yimin Deng, Huaizhen Tang, Xulong Zhang, Ning Cheng, **g Xiao, Jianzong Wang

Abstract: Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these… ▽ More Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these issues, we introduce a new method named "CTVC" which utilizes disentangled speech representations with contrastive learning and time-invariant retrieval. Specifically, a similarity-based compression module is used to facilitate a more intimate connection between the frame-level hidden features and linguistic information at phoneme-level. Additionally, a time-invariant retrieval is proposed for timbre extraction based on multiple segmentations and mutual information. Experimental results demonstrate that "CTVC" outperforms previous studies and improves the sound quality and similarity of converted results. △ Less

Submitted 17 January, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

Comments: Accepted by 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2024)

arXiv:2401.01544 [pdf, other]

Collaborative Perception for Connected and Autonomous Driving: Challenges, Possible Solutions and Opportunities

Authors: Senkang Hu, Zhengru Fang, Yiqin Deng, Xianhao Chen, Yuguang Fang

Abstract: Autonomous driving has attracted significant attention from both academia and industries, which is expected to offer a safer and more efficient driving system. However, current autonomous driving systems are mostly based on a single vehicle, which has significant limitations which still poses threats to driving safety. Collaborative perception with connected and autonomous vehicles (CAVs) shows a… ▽ More Autonomous driving has attracted significant attention from both academia and industries, which is expected to offer a safer and more efficient driving system. However, current autonomous driving systems are mostly based on a single vehicle, which has significant limitations which still poses threats to driving safety. Collaborative perception with connected and autonomous vehicles (CAVs) shows a promising solution to overcoming these limitations. In this article, we first identify the challenges of collaborative perception, such as data sharing asynchrony, data volume, and pose errors. Then, we discuss the possible solutions to address these challenges with various technologies, where the research opportunities are also elaborated. Furthermore, we propose a scheme to deal with communication efficiency and latency problems, which is a channel-aware collaborative perception framework to dynamically adjust the communication graph and minimize latency, thereby improving perception performance while increasing communication efficiency. Finally, we conduct experiments to demonstrate the effectiveness of our proposed scheme. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2401.01044 [pdf, other]

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Authors: **long Xue, Yayue Deng, Yingming Gao, Ya Li

Abstract: Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs.… ▽ More Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Furthermore, previous studies in T2I recognizes the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments of text-audio alignment in TTA. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which further demonstrated in several related tasks, such as audio style transfer, inpainting and other manipulations. Our implementation and demos are available at https://auffusion.github.io. △ Less

Submitted 2 January, 2024; originally announced January 2024.

Comments: Demo and implementation at https://auffusion.github.io

arXiv:2312.16383 [pdf, ps, other]

Frame-level emotional state alignment method for speech emotion recognition

Authors: Qifei Li, Yingming Gao, Cong Wang, Yayue Deng, **long Xue, Yichen Han, Ya Li

Abstract: Speech emotion recognition (SER) systems aim to recognize human emotional state during human-computer interaction. Most existing SER systems are trained based on utterance-level labels. However, not all frames in an audio have affective states consistent with utterance-level label, which makes it difficult for the model to distinguish the true emotion of the audio and perform poorly. To address th… ▽ More Speech emotion recognition (SER) systems aim to recognize human emotional state during human-computer interaction. Most existing SER systems are trained based on utterance-level labels. However, not all frames in an audio have affective states consistent with utterance-level label, which makes it difficult for the model to distinguish the true emotion of the audio and perform poorly. To address this problem, we propose a frame-level emotional state alignment method for SER. First, we fine-tune HuBERT model to obtain a SER system with task-adaptive pretraining (TAPT) method, and extract embeddings from its transformer layers to form frame-level pseudo-emotion labels with clustering. Then, the pseudo labels are used to pretrain HuBERT. Hence, the each frame output of HuBERT has corresponding emotional information. Finally, we fine-tune the above pretrained HuBERT for SER by adding an attention layer on the top of it, which can focus only on those frames that are emotionally more consistent with utterance-level label. The experimental results performed on IEMOCAP indicate that our proposed method performs better than state-of-the-art (SOTA) methods. △ Less

Submitted 26 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP 2024

arXiv:2312.13182 [pdf, other]

Task-oriented Semantics-aware Communications for Robotic Waypoint Transmission: the Value and Age of Information Approach

Authors: Wenchao Wu, Yuanqing Yang, Yansha Deng, A. Hamid Aghvami

Abstract: The ultra-reliable and low-latency communication (URLLC) service of the fifth-generation (5G) mobile communication network struggles to support safe robot operation. Nowadays, the sixth-generation (6G) mobile communication network is proposed to provide hyper-reliable and low-latency communication to enable safer control for robots. However, current 5G/ 6G research mainly focused on improving comm… ▽ More The ultra-reliable and low-latency communication (URLLC) service of the fifth-generation (5G) mobile communication network struggles to support safe robot operation. Nowadays, the sixth-generation (6G) mobile communication network is proposed to provide hyper-reliable and low-latency communication to enable safer control for robots. However, current 5G/ 6G research mainly focused on improving communication performance, while the robotics community mostly assumed communication to be ideal. To jointly consider communication and robotic control with a focus on the specific robotic task, we propose task-oriented and semantics-aware communication in robotic control (TSRC) to exploit the context of data and its importance in achieving the task at both transmitter and receiver. At the transmitter, we propose a deep reinforcement learning algorithm to generate optimal control and command (C&C) data and a proactive repetition scheme (DeepPro) to increase the successful transmission probability. At the receiver, we design the value of information (VoI) and age of information (AoI) based queue ordering mechanism (VA-QOM) to reorganize the queue based on the semantic information extracted from the AoI and the VoI. The simulation results validate that our proposed TSRC framework achieves a 91.5% improvement in the mean square error compared to the traditional unmanned aerial vehicle control framework. △ Less

Submitted 20 December, 2023; originally announced December 2023.

arXiv:2312.12358 [pdf, other]

Localization and Discrete Beamforming with a Large Reconfigurable Intelligent Surface

Authors: Baojia Luo, Yili Deng, Miaomiao Dong, Zhongyi Huang, Xiang Chen, Wei Han, Bo Bai

Abstract: In millimeter-wave (mmWave) cellular systems, reconfigurable intelligent surfaces (RISs) are foreseeably deployed with a large number of reflecting elements to achieve high beamforming gains. The large-sized RIS will make radio links fall in the near-field localization regime with spatial non-stationarity issues. Moreover, the discrete phase restriction on the RIS reflection coefficient incurs exp… ▽ More In millimeter-wave (mmWave) cellular systems, reconfigurable intelligent surfaces (RISs) are foreseeably deployed with a large number of reflecting elements to achieve high beamforming gains. The large-sized RIS will make radio links fall in the near-field localization regime with spatial non-stationarity issues. Moreover, the discrete phase restriction on the RIS reflection coefficient incurs exponential complexity for discrete beamforming. It remains an open problem to find the optimal RIS reflection coefficient design in polynomial time. To address these issues, we propose a scalable partitioned-far-field protocol that considers both the near-filed non-stationarity and discrete beamforming. The protocol approximates near-field signal propagation using a partitioned-far-field representation to inherit the sparsity from the sophisticated far-field and facilitate the near-field localization scheme. To improve the theoretical localization performance, we propose a fast passive beamforming (FPB) algorithm that optimally solves the discrete RIS beamforming problem, reducing the search complexity from exponential order to linear order. Furthermore, by exploiting the partitioned structure of RIS, we introduce a two-stage coarse-to-fine localization algorithm that leverages both the time delay and angle information. Numerical results demonstrate that centimeter-level localization precision is achieved under medium and high signal-to-noise ratios (SNR), revealing that RISs can provide support for low-cost and high-precision localization in future cellular systems. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 13 pages

arXiv:2311.08670 [pdf, other]

CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

Authors: Yimin Deng, Xulong Zhang, Jianzong Wang, Ning Cheng, **g Xiao

Abstract: Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard ne… ▽ More Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard negative samples based on the proposed speaker fusion module to improve learning ability of speaker encoder. Furthermore, considering the fine-grain modeling of speaker style, we employ a reference encoder to extract fine-grained style and conduct the augmented contrastive learning on global style. The experimental results show that the proposed method outperforms previous work in voice conversion tasks. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023)

arXiv:2310.07062 [pdf, other]

Acoustic Model Fusion for End-to-end Speech Recognition

Authors: Zhihong Lei, Mingbin Xu, Shiyi Han, Leo Liu, Zhen Huang, Tim Ng, Yuanyuan Zhang, Ernest Pusateri, Mirko Hannemann, Yaqiao Deng, Man-Hung Siu

Abstract: Recent advances in deep learning and automatic speech recognition (ASR) have enabled the end-to-end (E2E) ASR system and boosted the accuracy to a new level. The E2E systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the language model (LM), in a single network trained on audio-text pairs. Despite this simpler system architecture, fusing a separate LM, tr… ▽ More Recent advances in deep learning and automatic speech recognition (ASR) have enabled the end-to-end (E2E) ASR system and boosted the accuracy to a new level. The E2E systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the language model (LM), in a single network trained on audio-text pairs. Despite this simpler system architecture, fusing a separate LM, trained exclusively on text corpora, into the E2E system has proven to be beneficial. However, the application of LM fusion presents certain drawbacks, such as its inability to address the domain mismatch issue inherent to the internal AM. Drawing inspiration from the concept of LM fusion, we propose the integration of an external AM into the E2E system to better address the domain mismatch. By implementing this novel approach, we have achieved a significant reduction in the word error rate, with an impressive drop of up to 14.3% across varied test sets. We also discovered that this AM fusion approach is particularly beneficial in enhancing named entity recognition. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:2308.11084 [pdf, other]

doi 10.1145/3581783.3613800

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Authors: Yimin Deng, Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, **g Xiao

Abstract: Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosod… ▽ More Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech. △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: Accepted by the 31st ACM International Conference on Multimedia (MM2023)

arXiv:2306.14228 [pdf, ps, other]

Task-Oriented Semantics-Aware Communication for Wireless UAV Control and Command Transmission

Authors: Yujie Xu, Zhou Hui, Yansha Deng

Abstract: To guarantee the safety and smooth control of Unmanned Aerial Vehicle (UAV) operation, the new control and command (C&C) data type imposes stringent quality of service (QoS) requirements on the cellular network. However, the existing bit-oriented communication framework is already approaching the Shannon capacity limit, which can hardly guarantee the ultra-reliable low latency communications (URLL… ▽ More To guarantee the safety and smooth control of Unmanned Aerial Vehicle (UAV) operation, the new control and command (C&C) data type imposes stringent quality of service (QoS) requirements on the cellular network. However, the existing bit-oriented communication framework is already approaching the Shannon capacity limit, which can hardly guarantee the ultra-reliable low latency communications (URLLC) service for C&C transmission. To solve the problem, task-oriented semantics-aware (TOSA) communication has been proposed recently by jointly exploiting the context of data and its importance to the UAV control task. However, to the best of our knowledge, an explicit and systematic TOSA communication framework for emerging C&C data type remains unknown. Therefore, in this paper, we propose a TOSA communication framework for C&C transmission and define its value of information based on both the similarity and age of information (AoI) of C&C signals. We also propose a deep reinforcement learning (DRL) algorithm to maximize the TOSA information. Last but not least, we present the simulation results to validate the effectiveness of our proposed TOSA communication framework. △ Less

Submitted 25 June, 2023; originally announced June 2023.

arXiv:2306.12153 [pdf, other]

DIAS: A Dataset and Benchmark for Intracranial Artery Segmentation in DSA sequences

Authors: Wentao Liu, Tong Tian, Lemeng Wang, Wei** Xu, Lei Li, Haoyuan Li, Wenyi Zhao, Siyu Tian, Xipeng Pan, Huihua Yang, Feng Gao, Yiming Deng, Xin Yang, Ruisheng Su

Abstract: The automated segmentation of Intracranial Arteries (IA) in Digital Subtraction Angiography (DSA) plays a crucial role in the quantification of vascular morphology, significantly contributing to computer-assisted stroke research and clinical practice. Current research primarily focuses on the segmentation of single-frame DSA using proprietary datasets. However, these methods face challenges due to… ▽ More The automated segmentation of Intracranial Arteries (IA) in Digital Subtraction Angiography (DSA) plays a crucial role in the quantification of vascular morphology, significantly contributing to computer-assisted stroke research and clinical practice. Current research primarily focuses on the segmentation of single-frame DSA using proprietary datasets. However, these methods face challenges due to the inherent limitation of single-frame DSA, which only partially displays vascular contrast, thereby hindering accurate vascular structure representation. In this work, we introduce DIAS, a dataset specifically developed for IA segmentation in DSA sequences. We establish a comprehensive benchmark for evaluating DIAS, covering full, weak, and semi-supervised segmentation methods. Specifically, we propose the vessel sequence segmentation network, in which the sequence feature extraction module effectively captures spatiotemporal representations of intravascular contrast, achieving intracranial artery segmentation in 2D+Time DSA sequences. For weakly-supervised IA segmentation, we propose a novel scribble learning-based image segmentation framework, which, under the guidance of scribble labels, employs cross pseudo-supervision and consistency regularization to improve the performance of the segmentation network. Furthermore, we introduce the random patch-based self-training framework, aimed at alleviating the performance constraints encountered in IA segmentation due to the limited availability of annotated DSA data. Our extensive experiments on the DIAS dataset demonstrate the effectiveness of these methods as potential baselines for future research and clinical applications. The dataset and code are publicly available at https://doi.org/10.5281/zenodo.11396520 and https://github.com/lseventeen/DIAS. △ Less

Submitted 13 June, 2024; v1 submitted 21 June, 2023; originally announced June 2023.

arXiv:2306.04980 [pdf, other]

Assessing Phrase Break of ESL Speech with Pre-trained Language Models and Large Language Models

Authors: Zhiyi Wang, Shaoguang Mao, Wenshan Wu, Yan Xia, Yan Deng, Jonathan Tien

Abstract: This work introduces approaches to assessing phrase breaks in ESL learners' speech using pre-trained language models (PLMs) and large language models (LLMs). There are two tasks: overall assessment of phrase break for a speech clip and fine-grained assessment of every possible phrase break position. To leverage NLP models, speech input is first force-aligned with texts, and then pre-processed into… ▽ More This work introduces approaches to assessing phrase breaks in ESL learners' speech using pre-trained language models (PLMs) and large language models (LLMs). There are two tasks: overall assessment of phrase break for a speech clip and fine-grained assessment of every possible phrase break position. To leverage NLP models, speech input is first force-aligned with texts, and then pre-processed into a token sequence, including words and phrase break information. To utilize PLMs, we propose a pre-training and fine-tuning pipeline with the processed tokens. This process includes pre-training with a replaced break token detection module and fine-tuning with text classification and sequence labeling. To employ LLMs, we design prompts for ChatGPT. The experiments show that with the PLMs, the dependence on labeled training data has been greatly reduced, and the performance has improved. Meanwhile, we verify that ChatGPT, a renowned LLM, has potential for further advancement in this area. △ Less

Submitted 8 June, 2023; originally announced June 2023.

Comments: Accepted by InterSpeech 2023. arXiv admin note: substantial text overlap with arXiv:2210.16029

arXiv:2305.08000 [pdf, other]

DNN-Compressed Domain Visual Recognition with Feature Adaptation

Authors: Yingpeng Deng, Lina J. Karam

Abstract: Learning-based image compression was shown to achieve a competitive performance with state-of-the-art transform-based codecs. This motivated the development of new learning-based visual compression standards such as JPEG-AI. Of particular interest to these emerging standards is the development of learning-based image compression systems targeting both humans and machines. This paper is concerned w… ▽ More Learning-based image compression was shown to achieve a competitive performance with state-of-the-art transform-based codecs. This motivated the development of new learning-based visual compression standards such as JPEG-AI. Of particular interest to these emerging standards is the development of learning-based image compression systems targeting both humans and machines. This paper is concerned with learning-based compression schemes whose compressed-domain representations can be utilized to perform visual processing and computer vision tasks directly in the compressed domain. In our work, we adopt a learning-based compressed-domain classification framework for performing visual recognition using the compressed-domain latent representation at varying bit-rates. We propose a novel feature adaptation module integrating a lightweight attention model to adaptively emphasize and enhance the key features within the extracted channel-wise information. Also, we design an adaptation training strategy to utilize the pretrained pixel-domain weights. For comparison, in addition to the performance results that are obtained using our proposed latent-based compressed-domain method, we also present performance results using compressed but fully decoded images in the pixel domain as well as original uncompressed images. The obtained performance results show that our proposed compressed-domain classification model can distinctly outperform the existing compressed-domain classification models, and that it can also yield similar accuracy results with a much higher computational efficiency as compared to the pixel-domain models that are trained using fully decoded images. △ Less

Submitted 26 July, 2023; v1 submitted 13 May, 2023; originally announced May 2023.

arXiv:2305.02269 [pdf, other]

M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Authors: **long Xue, Yayue Deng, Feng** Wang, Ya Li, Yingming Gao, Jianhua Tao, Jianqing Sun, Jiaen Liang

Abstract: Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphas… ▽ More Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests. △ Less

Submitted 3 May, 2023; originally announced May 2023.

Comments: 5 pages, 1 figures, 2 tables. Accepted by ICASSP 2023

arXiv:2303.17949 [pdf, other]

Unsupervised Anomaly Detection and Localization of Machine Audio: A GAN-based Approach

Authors: Anbai Jiang, Wei-Qiang Zhang, Yufeng Deng, **yi Fan, Jia Liu

Abstract: Automatic detection of machine anomaly remains challenging for machine learning. We believe the capability of generative adversarial network (GAN) suits the need of machine audio anomaly detection, yet rarely has this been investigated by previous work. In this paper, we propose AEGAN-AD, a totally unsupervised approach in which the generator (also an autoencoder) is trained to reconstruct input s… ▽ More Automatic detection of machine anomaly remains challenging for machine learning. We believe the capability of generative adversarial network (GAN) suits the need of machine audio anomaly detection, yet rarely has this been investigated by previous work. In this paper, we propose AEGAN-AD, a totally unsupervised approach in which the generator (also an autoencoder) is trained to reconstruct input spectrograms. It is pointed out that the denoising nature of reconstruction deprecates its capacity. Thus, the discriminator is redesigned to aid the generator during both training stage and detection stage. The performance of AEGAN-AD on the dataset of DCASE 2022 Challenge TASK 2 demonstrates the state-of-the-art result on five machine types. A novel anomaly localization method is also investigated. Source code available at: www.github.com/jianganbai/AEGAN-AD △ Less

Submitted 31 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.10398 [pdf, other]

Energy-Efficient Cellular-Connected UAV Swarm Control Optimization

Authors: Yang Su, Hui Zhou, Yansha Deng, Mischa Dohler

Abstract: Cellular-connected unmanned aerial vehicle (UAV) swarm is a promising solution for diverse applications, including cargo delivery and traffic control. However, it is still challenging to communicate with and control the UAV swarm with high reliability, low latency, and high energy efficiency. In this paper, we propose a two-phase command and control (C&C) transmission scheme in a cellular-connecte… ▽ More Cellular-connected unmanned aerial vehicle (UAV) swarm is a promising solution for diverse applications, including cargo delivery and traffic control. However, it is still challenging to communicate with and control the UAV swarm with high reliability, low latency, and high energy efficiency. In this paper, we propose a two-phase command and control (C&C) transmission scheme in a cellular-connected UAV swarm network, where the ground base station (GBS) broadcasts the common C&C message in Phase I. In Phase II, the UAVs that have successfully decoded the C&C message will relay the message to the rest of UAVs via device-to-device (D2D) communications in either broadcast or unicast mode, under latency and energy constraints. To maximize the number of UAVs that receive the message successfully within the latency and energy constraints, we formulate the problem as a Constrained Markov Decision Process to find the optimal policy. To address this problem, we propose a decentralized constrained graph attention multi-agent Deep-Q-network (DCGA-MADQN) algorithm based on Lagrangian primal-dual policy optimization, where a PID-controller algorithm is utilized to update the Lagrange Multiplier. Simulation results show that our algorithm could maximize the number of UAVs that successfully receive the common C&C under energy constraints. △ Less

Submitted 18 March, 2023; originally announced March 2023.

arXiv:2302.09332 [pdf, other]

Incipient Fault Detection in Power Distribution System: A Time-Frequency Embedded Deep Learning Based Approach

Authors: Qiyue Li, Huan Luo, Hong Cheng, Yuxing Deng, Wei Sun, Weitao Li, Zhi Liu

Abstract: Incipient fault detection in power distribution systems is crucial to improve the reliability of the grid. However, the non-stationary nature and the inadequacy of the training dataset due to the self-recovery of the incipient fault signal, make the incipient fault detection in power distribution systems a great challenge. In this paper, we focus on incipient fault detection in power distribution… ▽ More Incipient fault detection in power distribution systems is crucial to improve the reliability of the grid. However, the non-stationary nature and the inadequacy of the training dataset due to the self-recovery of the incipient fault signal, make the incipient fault detection in power distribution systems a great challenge. In this paper, we focus on incipient fault detection in power distribution systems and address the above challenges. In particular, we propose an ADaptive Time-Frequency Memory(AD-TFM) cell by embedding wavelet transform into the Long Short-Term Memory (LSTM), to extract features in time and frequency domain from the non-stationary incipient fault signals.We make scale parameters and translation parameters of wavelet transform learnable to adapt to the dynamic input signals. Based on the stacked AD-TFM cells, we design a recurrent neural network with ATtention mechanism, named AD-TFM-AT model, to detect incipient fault with multi-resolution and multi-dimension analysis. In addition, we propose two data augmentation methods, namely phase switching and temporal sliding, to effectively enlarge the training datasets. Experimental results on two open datasets show that our proposed AD-TFM-AT model and data augmentation methods achieve state-of-the-art (SOTA) performance of incipient fault detection in power distribution system. We also disclose one used dataset logged at State Grid Corporation of China to facilitate future research. △ Less

Submitted 18 February, 2023; originally announced February 2023.

Comments: 15 pages

arXiv:2211.05295 [pdf, other]

Harmonizing output imbalance for defect segmentation on extremely-imbalanced photovoltaic module cells images

Authors: Jianye Yi, Xiaopin Zhong, Weixiang Liu, Zongze Wu, Yuanlong Deng, Zhengguang Wu

Abstract: The continuous development of the photovoltaic (PV) industry has raised high requirements for the quality of monocrystalline of PV module cells. When learning to segment defect regions in PV module cell images, Tiny Hidden Cracks (THC) lead to extremely-imbalanced samples. The ratio of defect pixels to normal pixels can be as low as 1:2000. This extreme imbalance makes it difficult to segment the… ▽ More The continuous development of the photovoltaic (PV) industry has raised high requirements for the quality of monocrystalline of PV module cells. When learning to segment defect regions in PV module cell images, Tiny Hidden Cracks (THC) lead to extremely-imbalanced samples. The ratio of defect pixels to normal pixels can be as low as 1:2000. This extreme imbalance makes it difficult to segment the THC of PV module cells, which is also a challenge for semantic segmentation. To address the problem of segmenting defects on extremely-imbalanced THC data, the paper makes contributions from three aspects: (1) it proposes an explicit measure for output imbalance; (2) it generalizes a distribution-based loss that can handle different types of output imbalances; and (3) it introduces a compound loss with our adaptive hyperparameter selection algorithm that can keep the consistency of training and inference for harmonizing the output imbalance on extremelyimbalanced input data. The proposed method is evaluated on four widely-used deep learning architectures and four datasets with varying degrees of input imbalance. The experimental results show that the proposed method outperforms existing methods. △ Less

Submitted 24 October, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

Comments: 19 pages, 16 figures, 3 appendixes

arXiv:2211.01676 [pdf, other]

Repeatable Random Permutation Set

Authors: Wenran Yang, Yong Deng

Abstract: Random permutation set (RPS), as a recently proposed theory, enables powerful information representation by traversing all possible permutations. However, the repetition of items is not allowed in RPS while it is quite common in real life. To address this issue, we propose repeatable random permutation set ($\rm R^2PS$) which takes the repetition of items into consideration. The right and left jun… ▽ More Random permutation set (RPS), as a recently proposed theory, enables powerful information representation by traversing all possible permutations. However, the repetition of items is not allowed in RPS while it is quite common in real life. To address this issue, we propose repeatable random permutation set ($\rm R^2PS$) which takes the repetition of items into consideration. The right and left junctional sum combination rules are proposed and their properties including consistency, pseudo-Matthew effect and associativity are researched. Based on these properties, a decision support system application is simulated to show the effectiveness of $\rm R^2PS$. △ Less

Submitted 4 November, 2022; v1 submitted 3 November, 2022; originally announced November 2022.

arXiv:2210.17016 [pdf, other]

Wespeaker: A Research and Production oriented Speaker Embedding Learning Toolkit

Authors: Hongji Wang, Chengdong Liang, Shuai Wang, Zhengyang Chen, Binbin Zhang, Xu Xiang, Yanlei Deng, Yanmin Qian

Abstract: Speaker modeling is essential for many related tasks, such as speaker recognition and speaker diarization. The dominant modeling approach is fixed-dimensional vector representation, i.e., speaker embedding. This paper introduces a research and production oriented speaker embedding learning toolkit, Wespeaker. Wespeaker contains the implementation of scalable data management, state-of-the-art speak… ▽ More Speaker modeling is essential for many related tasks, such as speaker recognition and speaker diarization. The dominant modeling approach is fixed-dimensional vector representation, i.e., speaker embedding. This paper introduces a research and production oriented speaker embedding learning toolkit, Wespeaker. Wespeaker contains the implementation of scalable data management, state-of-the-art speaker embedding models, loss functions, and scoring back-ends, with highly competitive results achieved by structured recipes which were adopted in the winning systems in several speaker verification challenges. The application to other downstream tasks such as speaker diarization is also exhibited in the related recipe. Moreover, CPU- and GPU-compatible deployment codes are integrated for production-oriented development. The toolkit is publicly available at https://github.com/wenet-e2e/wespeaker. △ Less

Submitted 1 November, 2022; v1 submitted 30 October, 2022; originally announced October 2022.

arXiv:2210.09372 [pdf, other]

Goal-Oriented Semantic Communications for 6G Networks

Authors: Hui Zhou, Yansha Deng, Xiaonan Liu, Nikolaos Pappas, Arumugam Nallanathan

Abstract: Upon the arrival of emerging devices, including Extended Reality (XR) and Unmanned Aerial Vehicles (UAVs), the traditional communication framework is approaching Shannon's physical capacity limit and fails to guarantee the massive amount of transmission within latency requirements. By jointly exploiting the context of data and its importance to the task, an emerging communication paradigm shift to… ▽ More Upon the arrival of emerging devices, including Extended Reality (XR) and Unmanned Aerial Vehicles (UAVs), the traditional communication framework is approaching Shannon's physical capacity limit and fails to guarantee the massive amount of transmission within latency requirements. By jointly exploiting the context of data and its importance to the task, an emerging communication paradigm shift to semantic level and effectiveness level is envisioned to be a key revolution in Sixth Generation (6G) networks. However, an explicit and systematic communication framework incorporating both semantic level and effectiveness level has not been proposed yet. In this article, we propose a generic goal-oriented semantic communication framework for various tasks with diverse data types, which incorporates both semantic level information and effectiveness-aware performance metrics. We first analyze the unique characteristics of all data types, and summarise the semantic information, along with corresponding extraction methods. We then propose a detailed goal-oriented semantic communication framework for different time-critical and non-critical tasks. In the goal-oriented semantic communication framework, we present the goal-oriented semantic information, extraction methods, recovery methods, and effectiveness-aware performance metrics. Last but not least, we present a goal-oriented semantic communication framework tailored for Unmanned Aerial Vehicle (UAV) control task to validate the effectiveness of the proposed goal-oriented semantic communication framework. △ Less

Submitted 6 April, 2024; v1 submitted 17 October, 2022; originally announced October 2022.

arXiv:2209.09411 [pdf, other]

Shepherding Control for Separating a Single Agent from a Swarm

Authors: Yaosheng Deng, Masaki Ogura, Aiyi Li, Naoki Wakamiya

Abstract: In this paper, we consider the swarm-control problem of spatially separating a specified target agent within the swarm from all the other agents, while maintaining the connectivity among the other agents. We specifically aim to achieve the separation by designing the movement algorithm of an external agent, called a shepherd, which exerts repulsive forces on the agents in the swarm. This problem h… ▽ More In this paper, we consider the swarm-control problem of spatially separating a specified target agent within the swarm from all the other agents, while maintaining the connectivity among the other agents. We specifically aim to achieve the separation by designing the movement algorithm of an external agent, called a shepherd, which exerts repulsive forces on the agents in the swarm. This problem has potential applications in the context of the manipulation of the swarm of micro- and nano-particles. We first formulate the separation problem, where the swarm agents (called sheep) are modeled by the Boid model. We then analytically study the special case of two-sheep swarms. By leveraging the analysis, we then propose a potential function-based movement algorithm of the shepherd to achieve separation while maintaining the connectivity within the remaining swarm. We demonstrate the effectiveness of the proposed algorithm with numerical simulations. △ Less

Submitted 19 September, 2022; originally announced September 2022.

Comments: 6 pages, 6 figures

arXiv:2207.00908 [pdf, other]

Interference Constrained Beam Alignment for Time-Varying Channels via Kernelized Bandits

Authors: Yuntian Deng, Xingyu Zhou, Arnob Ghosh, Abhishek Gupta, Ness B. Shroff

Abstract: To fully utilize the abundant spectrum resources in millimeter wave (mmWave), Beam Alignment (BA) is necessary for large antenna arrays to achieve large array gains. In practical dynamic wireless environments, channel modeling is challenging due to time-varying and multipath effects. In this paper, we formulate the beam alignment problem as a non-stationary online learning problem with the objecti… ▽ More To fully utilize the abundant spectrum resources in millimeter wave (mmWave), Beam Alignment (BA) is necessary for large antenna arrays to achieve large array gains. In practical dynamic wireless environments, channel modeling is challenging due to time-varying and multipath effects. In this paper, we formulate the beam alignment problem as a non-stationary online learning problem with the objective to maximize the received signal strength under interference constraint. In particular, we employ the non-stationary kernelized bandit to leverage the correlation among beams and model the complex beamforming and multipath channel functions. Furthermore, to mitigate interference to other user equipment, we leverage the primal-dual method to design a constrained UCB-type kernelized bandit algorithm. Our theoretical analysis indicates that the proposed algorithm can adaptively adjust the beam in time-varying environments, such that both the cumulative regret of the received signal and constraint violations have sublinear bounds with respect to time. This result is of independent interest for applications such as adaptive pricing and news ranking. In addition, the algorithm assumes the channel is a black-box function and does not require any prior knowledge for dynamic channel modeling, and thus is applicable in a variety of scenarios. We further show that if the information about the channel variation is known, the algorithm will have better theoretical guarantees and performance. Finally, we conduct simulations to highlight the effectiveness of the proposed algorithm. △ Less

Submitted 2 July, 2022; originally announced July 2022.

arXiv:2206.14150 [pdf, other]

Autonomous Smart Grid Fault Detection

Authors: Qiyue Li, Yuxing Deng, Xin Liu, Wei Sun, Weitao Li, Jie Li, Zhi Liu

Abstract: Smart grid plays a crucial role for the smart society and the upcoming carbon neutral society. Achieving autonomous smart grid fault detection is critical for smart grid system state awareness, maintenance and operation. This paper focuses on fault monitoring in smart grid and discusses the inherent technical challenges and solutions. In particular, we first present the basic principles of smart g… ▽ More Smart grid plays a crucial role for the smart society and the upcoming carbon neutral society. Achieving autonomous smart grid fault detection is critical for smart grid system state awareness, maintenance and operation. This paper focuses on fault monitoring in smart grid and discusses the inherent technical challenges and solutions. In particular, we first present the basic principles of smart grid fault detection. Then, we explain the new requirements for autonomous smart grid fault detection, the technical challenges and their possible solutions. A case study is introduced, as a preliminary study for autonomous smart grid fault detection. In addition, we highlight relevant directions for future research. △ Less

Submitted 27 May, 2022; originally announced June 2022.

arXiv:2204.12426 [pdf, ps, other]

Time-triggered Federated Learning over Wireless Networks

Authors: Xiaokang Zhou, Yansha Deng, Huiyun Xia, Shaochuan Wu, Mehdi Bennis

Abstract: The newly emerging federated learning (FL) framework offers a new way to train machine learning models in a privacy-preserving manner. However, traditional FL algorithms are based on an event-triggered aggregation, which suffers from stragglers and communication overhead issues. To address these issues, in this paper, we present a time-triggered FL algorithm (TT-Fed) over wireless networks, which… ▽ More The newly emerging federated learning (FL) framework offers a new way to train machine learning models in a privacy-preserving manner. However, traditional FL algorithms are based on an event-triggered aggregation, which suffers from stragglers and communication overhead issues. To address these issues, in this paper, we present a time-triggered FL algorithm (TT-Fed) over wireless networks, which is a generalized form of classic synchronous and asynchronous FL. Taking the constrained resource and unreliable nature of wireless communication into account, we jointly study the user selection and bandwidth optimization problem to minimize the FL training loss. To solve this joint optimization problem, we provide a thorough convergence analysis for TT-Fed. Based on the obtained analytical convergence upper bound, the optimization problem is decomposed into tractable sub-problems with respect to each global aggregation round, and finally solved by our proposed online search algorithm. Simulation results show that compared to asynchronous FL (FedAsync) and FL with asynchronous user tiers (FedAT) benchmarks, our proposed TT-Fed algorithm improves the converged test accuracy by up to 12.5% and 5%, respectively, under highly imbalanced and non-IID data, while substantially reducing the communication overhead. △ Less

Submitted 2 May, 2022; v1 submitted 26 April, 2022; originally announced April 2022.

arXiv:2204.10461 [pdf, other]

WaBERT: A Low-resource End-to-end Model for Spoken Language Understanding and Speech-to-BERT Alignment

Authors: Lin Yao, Jianfei Song, Ruizhuo Xu, Yingfang Yang, Zijian Chen, Yafeng Deng

Abstract: Historically lower-level tasks such as automatic speech recognition (ASR) and speaker identification are the main focus in the speech field. Interest has been growing in higher-level spoken language understanding (SLU) tasks recently, like sentiment analysis (SA). However, improving performances on SLU tasks remains a big challenge. Basically, there are two main methods for SLU tasks: (1) Two-stag… ▽ More Historically lower-level tasks such as automatic speech recognition (ASR) and speaker identification are the main focus in the speech field. Interest has been growing in higher-level spoken language understanding (SLU) tasks recently, like sentiment analysis (SA). However, improving performances on SLU tasks remains a big challenge. Basically, there are two main methods for SLU tasks: (1) Two-stage method, which uses a speech model to transfer speech to text, then uses a language model to get the results of downstream tasks; (2) One-stage method, which just fine-tunes a pre-trained speech model to fit in the downstream tasks. The first method loses emotional cues such as intonation, and causes recognition errors during ASR process, and the second one lacks necessary language knowledge. In this paper, we propose the Wave BERT (WaBERT), a novel end-to-end model combining the speech model and the language model for SLU tasks. WaBERT is based on the pre-trained speech and language model, hence training from scratch is not needed. We also set most parameters of WaBERT frozen during training. By introducing WaBERT, audio-specific information and language knowledge are integrated in the short-time and low-resource training process to improve results on the dev dataset of SLUE SA tasks by 1.15% of recall score and 0.82% of F1 score. Additionally, we modify the serial Continuous Integrate-and-Fire (CIF) mechanism to achieve the monotonic alignment between the speech and text modalities. △ Less

Submitted 21 April, 2022; originally announced April 2022.

arXiv:2204.08169 [pdf, ps, other]

Actions at the Edge: Jointly Optimizing the Resources in Multi-access Edge Computing

Authors: Yiqin Deng, Xianhao Chen, Guangyu Zhu, Yuguang Fang, Zhigang Chen, Xiaoheng Deng

Abstract: Multi-access edge computing (MEC) is an emerging paradigm that pushes resources for sensing, communications, computing, storage and intelligence (SCCSI) to the premises closer to the end users, i.e., the edge, so that they could leverage the nearby rich resources to improve their quality of experience (QoE). Due to the growing emerging applications targeting at intelligentizing life-sustaining cyb… ▽ More Multi-access edge computing (MEC) is an emerging paradigm that pushes resources for sensing, communications, computing, storage and intelligence (SCCSI) to the premises closer to the end users, i.e., the edge, so that they could leverage the nearby rich resources to improve their quality of experience (QoE). Due to the growing emerging applications targeting at intelligentizing life-sustaining cyber-physical systems, this paradigm has become a hot research topic, particularly when MEC is utilized to provide edge intelligence and real-time processing and control. This article is to elaborate the research issues along this line, including basic concepts and performance metrics, killer applications, architectural design, modeling approaches and solutions, and future research directions. It is hoped that this article provides a quick introduction to this fruitful research area particularly for beginning researchers. △ Less

Submitted 18 April, 2022; originally announced April 2022.

Comments: 7 pages, 2 figures, accepted by IEEE Wireless Communications

arXiv:2203.10473 [pdf, other]

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Authors: **long Xue, Yayue Deng, Yichen Han, Ya Li, Jianqing Sun, Jiaen Liang

Abstract: In recent years, neural network based methods for multi-speaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate high-quality speech and better similarity for… ▽ More In recent years, neural network based methods for multi-speaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate high-quality speech and better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker encoder based on the state-of-the-art ECAPA-TDNN model which is derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. The comparison among different speaker encoder models shows our proposed method can achieve better naturalness and similarity. To efficiently evaluate our synthesized speech, we are the first to adopt deep learning based automatic MOS evaluation methods to assess our results, and these methods show great potential in automatic speech quality assessment. △ Less

Submitted 26 March, 2022; v1 submitted 20 March, 2022; originally announced March 2022.

Comments: 5 pages, 2 figures, submitted to interspeech2022

arXiv:2203.03004 [pdf, other]

Low-Complexity Beamforming Design for IRS-Aided NOMA Communication System with Imperfect CSI

Authors: Yasaman Omid, S. M. Mahdi Shahabi, Cunhua Pan, Yansha Deng, Arumugam Nallanathan

Abstract: Intelligent reflecting surface (IRS) as a promising technology rendering high throughput in future communication systems is compatible with various communication techniques such as non-orthogonal multiple-access (NOMA). In this paper, the downlink transmission of IRS-assisted NOMA communication is considered while undergoing imperfect channel state information (CSI). Consequently, a robust IRS-aid… ▽ More Intelligent reflecting surface (IRS) as a promising technology rendering high throughput in future communication systems is compatible with various communication techniques such as non-orthogonal multiple-access (NOMA). In this paper, the downlink transmission of IRS-assisted NOMA communication is considered while undergoing imperfect channel state information (CSI). Consequently, a robust IRS-aided NOMA design is proposed by solving the sum-rate maximization problem to jointly find the optimal beamforming vectors for the access point and the passive reflection matrix for the IRS, using the penalty dual decomposition (PDD) scheme. This problem can be solved through an iterative algorithm, with closed-form solutions in each step, and it is shown to have very close performance to its upper bound obtained from perfect CSI scenario. We also present a trellis-based method for optimal discrete phase shift selection of IRS which is shown to outperform the conventional quantization method. Our results show that the proposed algorithms, for both continuous and discrete IRS, have very low computational complexity compared to other schemes in the literature. Furthermore, we conduct a performance comparison from achievable sum-rate standpoint between IRS-aided NOMA and IRS-aided orthogonal multiple access (OMA), which demonstrates superiority of NOMA compared to OMA in case of a tolerated channel uncertainty. △ Less

Submitted 16 March, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

arXiv:2203.02098 [pdf, other]

Universal Segmentation of 33 Anatomies

Authors: Pengbo Liu, Yang Deng, Ce Wang, Yuan Hui, Qian Li, Jun Li, Shiwei Luo, Mengke Sun, Quan Quan, Shuxin Yang, You Hao, Honghu Xiao, Chunpeng Zhao, Xinbao Wu, S. Kevin Zhou

Abstract: In the paper, we present an approach for learning a single model that universally segments 33 anatomical structures, including vertebrae, pelvic bones, and abdominal organs. Our model building has to address the following challenges. Firstly, while it is ideal to learn such a model from a large-scale, fully-annotated dataset, it is practically hard to curate such a dataset. Thus, we resort to lear… ▽ More In the paper, we present an approach for learning a single model that universally segments 33 anatomical structures, including vertebrae, pelvic bones, and abdominal organs. Our model building has to address the following challenges. Firstly, while it is ideal to learn such a model from a large-scale, fully-annotated dataset, it is practically hard to curate such a dataset. Thus, we resort to learn from a union of multiple datasets, with each dataset containing the images that are partially labeled. Secondly, along the line of partial labelling, we contribute an open-source, large-scale vertebra segmentation dataset for the benefit of spine analysis community, CTSpine1K, boasting over 1,000 3D volumes and over 11K annotated vertebrae. Thirdly, in a 3D medical image segmentation task, due to the limitation of GPU memory, we always train a model using cropped patches as inputs instead a whole 3D volume, which limits the amount of contextual information to be learned. To this, we propose a cross-patch transformer module to fuse more information in adjacent patches, which enlarges the aggregated receptive field for improved segmentation performance. This is especially important for segmenting, say, the elongated spine. Based on 7 partially labeled datasets that collectively contain about 2,800 3D volumes, we successfully learn such a universal model. Finally, we evaluate the universal model on multiple open-source datasets, proving that our model has a good generalization performance and can potentially serve as a solid foundation for downstream tasks. △ Less

Submitted 3 March, 2022; originally announced March 2022.

arXiv:2112.09312 [pdf, other]

MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling

Authors: Yusong Wu, Ethan Manilow, Yi Deng, Rigel Swavely, Kyle Kastner, Tim Cooijmans, Aaron Courville, Cheng-Zhi Anna Huang, Jesse Engel

Abstract: Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments… ▽ More Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation). This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence. By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience. △ Less

Submitted 17 March, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: Accepted by International Conference on Learning Representations (ICLR) 2022

arXiv:2111.09284 [pdf, other]

Optimization of Grant-Free NOMA with Multiple Configured-Grants for mURLLC

Authors: Yan Liu, Yansha Deng, Maged Elkashlan, Arumugam Nallanathan, George K. Karagiannidis

Abstract: Massive Ultra-Reliable and Low-Latency Communications (mURLLC), which integrates URLLC with massive access, is emerging as a new and important service class in the next generation (6G) for time-sensitive traffics and has recently received tremendous research attention. However, realizing efficient, delay-bounded, and reliable communications for a massive number of user equipments (UEs) in mURLLC,… ▽ More Massive Ultra-Reliable and Low-Latency Communications (mURLLC), which integrates URLLC with massive access, is emerging as a new and important service class in the next generation (6G) for time-sensitive traffics and has recently received tremendous research attention. However, realizing efficient, delay-bounded, and reliable communications for a massive number of user equipments (UEs) in mURLLC, is extremely challenging as it needs to simultaneously take into account the latency, reliability, and massive access requirements. To support these requirements, the third generation partnership project (3GPP) has introduced enhanced grant-free (GF) transmission in the uplink (UL), with multiple active configured-grants (CGs) for URLLC UEs. With multiple CGs (MCG) for UL, UE can choose any of these grants as soon as the data arrives. In addition, non-orthogonal multiple access (NOMA) has been proposed to synergize with GF transmission to mitigate the serious transmission delay and network congestion problems. In this paper, we develop a novel learning framework for MCG-GF-NOMA systems with bursty traffic. We first design the MCG-GF-NOMA model by characterizing each CG using the parameters: the number of contention-transmission units (CTUs), the starting slot of each CG within a subframe, and the number of repetitions of each CG. Based on the model, the latency and reliability performances are characterized. We then formulate the MCG-GF-NOMA resources configuration problem taking into account three constraints. Finally, we propose a Cooperative Multi-Agent based Double Deep Q-Network (CMA-DDQN) algorithm to allocate the channel resources among MCGs so as to maximize the number of successful transmissions under the latency constraint. Our results show that the MCG-GF-NOMA framework can simultaneously improve the low latency and high reliability performances in massive URLLC. △ Less

Submitted 17 November, 2021; originally announced November 2021.

Comments: 15 pages, 15 figures, submitted to IEEE JSAC SI on Next Generation Multiple Access. arXiv admin note: text overlap with arXiv:2101.00515

arXiv:2111.00418 [pdf, other]

Deep Learning in Human Activity Recognition with Wearable Sensors: A Review on Advances

Authors: Shibo Zhang, Yaxuan Li, Shen Zhang, Farzad Shahabi, Stephen Xia, Yu Deng, Nabil Alshurafa

Abstract: Mobile and wearable devices have enabled numerous applications, including activity tracking, wellness monitoring, and human--computer interaction, that measure and improve our daily lives. Many of these applications are made possible by leveraging the rich collection of low-power sensors found in many mobile and wearable devices to perform human activity recognition (HAR). Recently, deep learning… ▽ More Mobile and wearable devices have enabled numerous applications, including activity tracking, wellness monitoring, and human--computer interaction, that measure and improve our daily lives. Many of these applications are made possible by leveraging the rich collection of low-power sensors found in many mobile and wearable devices to perform human activity recognition (HAR). Recently, deep learning has greatly pushed the boundaries of HAR on mobile and wearable devices. This paper systematically categorizes and summarizes existing work that introduces deep learning methods for wearables-based HAR and provides a comprehensive analysis of the current advancements, develo** trends, and major challenges. We also present cutting-edge frontiers and future directions for deep learning-based HAR. △ Less

Submitted 3 March, 2022; v1 submitted 31 October, 2021; originally announced November 2021.

arXiv:2108.00506 [pdf, other]

Scalable Multi-agent Reinforcement Learning Algorithm for Wireless Networks

Authors: Fenghe Hu, Yansha Deng, A. Hamid Aghvami

Abstract: Scalability is the key roadstone towards the application of cooperative intelligent algorithms in large-scale networks. Reinforcement learning (RL) is known as model-free and high efficient intelligent algorithm for communication problems and proved useful in the communication network. However, when coming to large-scale networks with limited centralization, it is not possible to employ a centrali… ▽ More Scalability is the key roadstone towards the application of cooperative intelligent algorithms in large-scale networks. Reinforcement learning (RL) is known as model-free and high efficient intelligent algorithm for communication problems and proved useful in the communication network. However, when coming to large-scale networks with limited centralization, it is not possible to employ a centralized entity to perform joint real-time decision making for entire network. This introduces the scalability challenges, while multi-agent reinforcement shows the opportunity to cope this challenges and extend the intelligent algorithm to cooperative large-scale network. In this paper, we introduce the federated mean-field multi-agent reinforcement learning structure to capture the problem in large scale multi-agent communication scenarios, where agents share parameters to form consistency. We present the theoretical basis of our architecture and show the influence of federated frequency with an informational multi-agent model. We then exam the performance of our architecture with a coordinated multi-point environment which requires handshakes between neighbour access-points to realise the cooperation gain. Our result shows that the learning structure can effectively solve the cooperation problem in a large scale network with decent scalability. We also show the effectiveness of federated algorithms and highlight the importance of maintaining personality in each access-point. △ Less

Submitted 4 November, 2021; v1 submitted 1 August, 2021; originally announced August 2021.

Comments: 18 pages, 9 figures

arXiv:2107.12943 [pdf, other]

Learning-based Prediction, Rendering and Transmission for Interactive Virtual Reality in RIS-Assisted Terahertz Networks

Authors: Xiaonan Liu, Yansha Deng, Chong Han, Marco Di Renzo

Abstract: The quality of experience (QoE) requirements of wireless Virtual Reality (VR) can only be satisfied with high data rate, high reliability, and low VR interaction latency. This high data rate over short transmission distances may be achieved via abundant bandwidth in the terahertz (THz) band. However, THz waves suffer from severe signal attenuation, which may be compensated by the reconfigurable in… ▽ More The quality of experience (QoE) requirements of wireless Virtual Reality (VR) can only be satisfied with high data rate, high reliability, and low VR interaction latency. This high data rate over short transmission distances may be achieved via abundant bandwidth in the terahertz (THz) band. However, THz waves suffer from severe signal attenuation, which may be compensated by the reconfigurable intelligent surface (RIS) technology with programmable reflecting elements. Meanwhile, the low VR interaction latency may be achieved with the mobile edge computing (MEC) network architecture due to its high computation capability. Motivated by these considerations, in this paper, we propose a MEC-enabled and RIS-assisted THz VR network in an indoor scenario, by taking into account the uplink viewpoint prediction and position transmission, MEC rendering, and downlink transmission. We propose two methods, which are referred to as centralized online Gated Recurrent Unit (GRU) and distributed Federated Averaging (FedAvg), to predict the viewpoints of VR users. In the uplink, an algorithm that integrates online Long-short Term Memory (LSTM) and Convolutional Neural Networks (CNN) is deployed to predict the locations and the line-of-sight and non-line-of-sight statuses of the VR users over time. In the downlink, we further develop a constrained deep reinforcement learning algorithm to select the optimal phase shifts of the RIS under latency constraints. Simulation results show that our proposed learning architecture achieves near-optimal QoE as that of the genie-aided benchmark algorithm, and about two times improvement in QoE compared to the random phase shift selection scheme. △ Less

Submitted 27 July, 2021; originally announced July 2021.

arXiv:2106.04312 [pdf, other]

Speech BERT Embedding For Improving Prosody in Neural TTS

Authors: Li** Chen, Yan Deng, Xi Wang, Frank K. Soong, Lei He

Abstract: This paper presents a speech BERT model to extract embedded prosody information in speech segments for improving the prosody of synthesized speech in neural text-to-speech (TTS). As a pre-trained model, it can learn prosody attributes from a large amount of speech data, which can utilize more data than the original training data used by the target TTS. The embedding is extracted from the previous… ▽ More This paper presents a speech BERT model to extract embedded prosody information in speech segments for improving the prosody of synthesized speech in neural text-to-speech (TTS). As a pre-trained model, it can learn prosody attributes from a large amount of speech data, which can utilize more data than the original training data used by the target TTS. The embedding is extracted from the previous segment of a fixed length in the proposed BERT. The extracted embedding is then used together with the mel-spectrogram to predict the following segment in the TTS decoder. Experimental results obtained by the Transformer TTS show that the proposed BERT can extract fine-grained, segment-level prosody, which is complementary to utterance-level prosody to improve the final prosody of the TTS speech. The objective distortions measured on a single speaker TTS are reduced between the generated speech and original recordings. Subjective listening tests also show that the proposed approach is favorably preferred over the TTS without the BERT prosody embedding module, for both in-domain and out-of-domain applications. For Microsoft professional, single/multiple speakers and the LJ Speaker in the public database, subjective preference is similarly confirmed with the new BERT prosody embedding. TTS demo audio samples are in https://judy44chen.github.io/TTSSpeechBERT/. △ Less

Submitted 14 September, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

Journal ref: ICASSP 2021

arXiv:2106.02800 [pdf, other]

AOSLO-net: A deep learning-based method for automatic segmentation of retinal microaneurysms from adaptive optics scanning laser ophthalmoscope images

Authors: Qian Zhang, Konstantina Sampani, Mengjia Xu, Shengze Cai, Yixiang Deng, He Li, Jennifer K. Sun, George Em Karniadakis

Abstract: Microaneurysms (MAs) are one of the earliest signs of diabetic retinopathy (DR), a frequent complication of diabetes that can lead to visual impairment and blindness. Adaptive optics scanning laser ophthalmoscopy (AOSLO) provides real-time retinal images with resolution down to 2 $μm$ and thus allows detection of the morphologies of individual MAs, a potential marker that might dictate MA patholog… ▽ More Microaneurysms (MAs) are one of the earliest signs of diabetic retinopathy (DR), a frequent complication of diabetes that can lead to visual impairment and blindness. Adaptive optics scanning laser ophthalmoscopy (AOSLO) provides real-time retinal images with resolution down to 2 $μm$ and thus allows detection of the morphologies of individual MAs, a potential marker that might dictate MA pathology and affect the progression of DR. In contrast to the numerous automatic models developed for assessing the number of MAs on fundus photographs, currently there is no high throughput image protocol available for automatic analysis of AOSLO photographs. To address this urgency, we introduce AOSLO-net, a deep neural network framework with customized training policies to automatically segment MAs from AOSLO images. We evaluate the performance of AOSLO-net using 87 DR AOSLO images and our results demonstrate that the proposed model outperforms the state-of-the-art segmentation model both in accuracy and cost and enables correct MA morphological classification. △ Less

Submitted 25 June, 2021; v1 submitted 5 June, 2021; originally announced June 2021.

arXiv:2105.14711 [pdf, other]

CTSpine1K: A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

Authors: Yang Deng, Ce Wang, Yuan Hui, Qian Li, Jun Li, Shiwei Luo, Mengke Sun, Quan Quan, Shuxin Yang, You Hao, Pengbo Liu, Honghu Xiao, Chunpeng Zhao, Xinbao Wu, S. Kevin Zhou

Abstract: Spine-related diseases have high morbidity and cause a huge burden of social cost. Spine imaging is an essential tool for noninvasively visualizing and assessing spinal pathology. Segmenting vertebrae in computed tomography (CT) images is the basis of quantitative medical image analysis for clinical diagnosis and surgery planning of spine diseases. Current publicly available annotated datasets on… ▽ More Spine-related diseases have high morbidity and cause a huge burden of social cost. Spine imaging is an essential tool for noninvasively visualizing and assessing spinal pathology. Segmenting vertebrae in computed tomography (CT) images is the basis of quantitative medical image analysis for clinical diagnosis and surgery planning of spine diseases. Current publicly available annotated datasets on spinal vertebrae are small in size. Due to the lack of a large-scale annotated spine image dataset, the mainstream deep learning-based segmentation methods, which are data-driven, are heavily restricted. In this paper, we introduce a large-scale spine CT dataset, called CTSpine1K, curated from multiple sources for vertebra segmentation, which contains 1,005 CT volumes with over 11,100 labeled vertebrae belonging to different spinal conditions. Based on this dataset, we conduct several spinal vertebrae segmentation experiments to set the first benchmark. We believe that this large-scale dataset will facilitate further research in many spine-related image analysis tasks, including but not limited to vertebrae segmentation, labeling, 3D spine reconstruction from biplanar radiographs, image super-resolution, and enhancement. △ Less

Submitted 5 July, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

arXiv:2105.14576 [pdf, other]

StyTr$^2$: Image Style Transfer with Transformers

Authors: Yingying Deng, Fan Tang, Weiming Dong, Chongyang Ma, Xingjia Pan, Lei Wang, Changsheng Xu

Abstract: The goal of image style transfer is to render an image with artistic features guided by a style reference while maintaining the original content. Owing to the locality in convolutional neural networks (CNNs), extracting and maintaining the global information of input images is difficult. Therefore, traditional neural style transfer methods face biased content representation. To address this critic… ▽ More The goal of image style transfer is to render an image with artistic features guided by a style reference while maintaining the original content. Owing to the locality in convolutional neural networks (CNNs), extracting and maintaining the global information of input images is difficult. Therefore, traditional neural style transfer methods face biased content representation. To address this critical issue, we take long-range dependencies of input images into account for image style transfer by proposing a transformer-based approach called StyTr$^2$. In contrast with visual transformers for other vision tasks, StyTr$^2$ contains two different transformer encoders to generate domain-specific sequences for content and style, respectively. Following the encoders, a multi-layer transformer decoder is adopted to stylize the content sequence according to the style sequence. We also analyze the deficiency of existing positional encoding methods and propose the content-aware positional encoding (CAPE), which is scale-invariant and more suitable for image style transfer tasks. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed StyTr$^2$ compared with state-of-the-art CNN-based and flow-based approaches. Code and models are available at https://github.com/diyiiyiii/StyTR-2. △ Less

Submitted 1 April, 2022; v1 submitted 30 May, 2021; originally announced May 2021.

Comments: Accepted by CVPR 2022

arXiv:2104.03815 [pdf, other]

Exploring Machine Speech Chain for Domain Adaptation and Few-Shot Speaker Adaptation

Authors: Fengpeng Yue, Yan Deng, Lei He, Tom Ko

Abstract: Machine Speech Chain, which integrates both end-to-end (E2E) automatic speech recognition (ASR) and text-to-speech (TTS) into one circle for joint training, has been proven to be effective in data augmentation by leveraging large amounts of unpaired data. In this paper, we explore the TTS->ASR pipeline in speech chain to do domain adaptation for both neural TTS and E2E ASR models, with only text d… ▽ More Machine Speech Chain, which integrates both end-to-end (E2E) automatic speech recognition (ASR) and text-to-speech (TTS) into one circle for joint training, has been proven to be effective in data augmentation by leveraging large amounts of unpaired data. In this paper, we explore the TTS->ASR pipeline in speech chain to do domain adaptation for both neural TTS and E2E ASR models, with only text data from target domain. We conduct experiments by adapting from audiobook domain (LibriSpeech) to presentation domain (TED-LIUM), there is a relative word error rate (WER) reduction of 10% for the E2E ASR model on the TED-LIUM test set, and a relative WER reduction of 51.5% in synthetic speech generated by neural TTS in the presentation domain. Further, we apply few-shot speaker adaptation for the E2E ASR by using a few utterances from target speakers in an unsupervised way, results in additional gains. △ Less

Submitted 8 April, 2021; originally announced April 2021.

arXiv:2103.10241 [pdf, ps, other]

Analyzing Uplink Grant-free Sparse Code Multiple Access System in Massive IoT Networks

Authors: Ke Lai, **g Lei, Yansha Deng, Lei Wen, Gaojie Chen

Abstract: Grant-free sparse code multiple access (GF-SCMA) is considered to be a promising multiple access candidate for future wireless networks. In this paper, we focus on characterizing the performance of uplink GF-SCMA schemes in a network with ubiquitous connections, such as the Internet of Things (IoT) networks. To provide a tractable approach to evaluate the performance of GF-SCMA, we first develop a… ▽ More Grant-free sparse code multiple access (GF-SCMA) is considered to be a promising multiple access candidate for future wireless networks. In this paper, we focus on characterizing the performance of uplink GF-SCMA schemes in a network with ubiquitous connections, such as the Internet of Things (IoT) networks. To provide a tractable approach to evaluate the performance of GF-SCMA, we first develop a theoretical model taking into account the property of multi-user detection (MUD) in the SCMA system. We then analyze the error rate performance of GF-SCMA in the case of codebook collision to investigate the reliability of GF-SCMA when reusing codebook in massive IoT networks. For performance evaluation, accurate approximations for both success probability and average symbol error probability (ASEP) are derived. To elaborate further, we utilize the analytical results to discuss the impact of codeword sparse degree in GFSCMA. After that, we conduct a comparative study between SCMA and its variant, dense code multiple access (DCMA), with GF transmission to offer insights into the effectiveness of these two schemes. This facilitates the GF-SCMA system design in practical implementation. Simulation results show that denser codebooks can help to support more UEs and increase the reliability of data transmission in a GF-SCMA network. Moreover, a higher success probability can be achieved by GFSCMA with denser UE deployment at low detection thresholds since SCMA can achieve overloading gain. △ Less

Submitted 18 March, 2021; originally announced March 2021.

arXiv:2102.10637 [pdf, other]

QoE Optimization for Live Video Streaming in UAV-to-UAV Communications via Deep Reinforcement Learning

Authors: Liyana Adilla binti Burhanuddin, Xiaonan Liu, Yansha Deng, Ursula Challita, Andras Zahemszky

Abstract: A challenge for rescue teams when fighting against wildfire in remote areas is the lack of information, such as the size and images of fire areas. As such, live streaming from Unmanned Aerial Vehicles (UAVs), capturing videos of dynamic fire areas, is crucial for firefighter commanders in any location to monitor the fire situation with quick response. The 5G network is a promising wireless technol… ▽ More A challenge for rescue teams when fighting against wildfire in remote areas is the lack of information, such as the size and images of fire areas. As such, live streaming from Unmanned Aerial Vehicles (UAVs), capturing videos of dynamic fire areas, is crucial for firefighter commanders in any location to monitor the fire situation with quick response. The 5G network is a promising wireless technology to support such scenarios. In this paper, we consider a UAV-to-UAV (U2U) communication scenario, where a UAV at a high altitude acts as a mobile base station (UAV-BS) to stream videos from other flying UAV-users (UAV-UEs) through the uplink. Due to the mobility of the UAV-BS and UAV-UEs, it is important to determine the optimal movements and transmission powers for UAV-BSs and UAV-UEs in real-time, so as to maximize the data rate of video transmission with smoothness and low latency, while mitigating the interference according to the dynamics in fire areas and wireless channel conditions. In this paper, we co-design the video resolution, the movement, and the power control of UAV-BS and UAV-UEs to maximize the Quality of Experience (QoE) of real-time video streaming. To learn the Deep Q-Network (DQN) and Actor-Critic (AC) to maximize the QoE of video transmission from all UAV-UEs to a single UAVBS. Simulation results show the effectiveness of our proposed algorithm in terms of the QoE, delay and video smoothness as compared to the Greedy algorithm. △ Less

Submitted 21 February, 2021; originally announced February 2021.

Showing 1–50 of 103 results for author: Deng, Y