Search | arXiv e-print repository

Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition

Authors: Yicong Jiang, Tianzi Wang, Xurong Xie, Juan Liu, Wei Sun, Nan Yan, Hui Chen, Lan Wang, Xunying Liu, Feng Tian

Abstract: Disordered speech recognition profound implications for improving the quality of life for individuals afflicted with, for example, dysarthria. Dysarthric speech recognition encounters challenges including limited data, substantial dissimilarities between dysarthric and non-dysarthric speakers, and significant speaker variations stemming from the disorder. This paper introduces Perceiver-Prompt, a… ▽ More Disordered speech recognition profound implications for improving the quality of life for individuals afflicted with, for example, dysarthria. Dysarthric speech recognition encounters challenges including limited data, substantial dissimilarities between dysarthric and non-dysarthric speakers, and significant speaker variations stemming from the disorder. This paper introduces Perceiver-Prompt, a method for speaker adaptation that utilizes P-Tuning on the Whisper large-scale model. We first fine-tune Whisper using LoRA and then integrate a trainable Perceiver to generate fixed-length speaker prompts from variable-length inputs, to improve model recognition of Chinese dysarthric speech. Experimental results from our Chinese dysarthric speech dataset demonstrate consistent improvements in recognition performance with Perceiver-Prompt. Relative reduction up to 13.04% in CER is obtained over the fine-tuned Whisper. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted by interspeech 2024

arXiv:2405.03254 [pdf]

Automatic Assessment of Dysarthria Using Audio-visual Vowel Graph Attention Network

Authors: Xiaokang Liu, Xiaoxia Du, Juan Liu, Rongfeng Su, Manwa Lawrence Ng, Yumei Zhang, Yudong Yang, Shaofeng Zhao, Lan Wang, Nan Yan

Abstract: Automatic assessment of dysarthria remains a highly challenging task due to high variability in acoustic signals and the limited data. Currently, research on the automatic assessment of dysarthria primarily focuses on two approaches: one that utilizes expert features combined with machine learning, and the other that employs data-driven deep learning methods to extract representations. Research ha… ▽ More Automatic assessment of dysarthria remains a highly challenging task due to high variability in acoustic signals and the limited data. Currently, research on the automatic assessment of dysarthria primarily focuses on two approaches: one that utilizes expert features combined with machine learning, and the other that employs data-driven deep learning methods to extract representations. Research has demonstrated that expert features are effective in representing pathological characteristics, while deep learning methods excel at uncovering latent features. Therefore, integrating the advantages of expert features and deep learning to construct a neural network architecture based on expert knowledge may be beneficial for interpretability and assessment performance. In this context, the present paper proposes a vowel graph attention network based on audio-visual information, which effectively integrates the strengths of expert knowledges and deep learning. Firstly, various features were combined as inputs, including knowledge based acoustical features and deep learning based pre-trained representations. Secondly, the graph network structure based on vowel space theory was designed, allowing for a deep exploration of spatial correlations among vowels. Finally, visual information was incorporated into the model to further enhance its robustness and generalizability. The method exhibited superior performance in regression experiments targeting Frenchay scores compared to existing approaches. △ Less

Submitted 6 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

Comments: 10 pages, 7 figures, 7 tables

arXiv:2403.05820 [pdf, other]

An Audio-textual Diffusion Model For Converting Speech Signals Into Ultrasound Tongue Imaging Data

Authors: Yudong Yang, Rongfeng Su, Xiaokang Liu, Nan Yan, Lan Wang

Abstract: Acoustic-to-articulatory inversion (AAI) is to convert audio into articulator movements, such as ultrasound tongue imaging (UTI) data. An issue of existing AAI methods is only using the personalized acoustic information to derive the general patterns of tongue motions, and thus the quality of generated UTI data is limited. To address this issue, this paper proposes an audio-textual diffusion model… ▽ More Acoustic-to-articulatory inversion (AAI) is to convert audio into articulator movements, such as ultrasound tongue imaging (UTI) data. An issue of existing AAI methods is only using the personalized acoustic information to derive the general patterns of tongue motions, and thus the quality of generated UTI data is limited. To address this issue, this paper proposes an audio-textual diffusion model for the UTI data generation task. In this model, the inherent acoustic characteristics of individuals related to the tongue motion details are encoded by using wav2vec 2.0, while the ASR transcriptions related to the universality of tongue motions are encoded by using BERT. UTI data are then generated by using a diffusion module. Experimental results showed that the proposed diffusion model could generate high-quality UTI data with clear tongue contour that is crucial for the linguistic analysis and clinical assessment. The project can be found on the website\footnote{https://yangyudong2020.github.io/wav2uti/ △ Less

Submitted 12 March, 2024; v1 submitted 9 March, 2024; originally announced March 2024.

Comments: ICASSP2024 Accept

arXiv:2401.01997 [pdf]

Generating Rhythm Game Music with Jukebox

Authors: Nicholas Yan

Abstract: Music has always been thought of as a "human" endeavor -- when praising a piece of music, we emphasize the composer's creativity and the emotions the music invokes. Because music also heavily relies on patterns and repetition in the form of recurring melodic themes and chord progressions, artificial intelligence has increasingly been able to replicate music in a human-like fashion. This research i… ▽ More Music has always been thought of as a "human" endeavor -- when praising a piece of music, we emphasize the composer's creativity and the emotions the music invokes. Because music also heavily relies on patterns and repetition in the form of recurring melodic themes and chord progressions, artificial intelligence has increasingly been able to replicate music in a human-like fashion. This research investigated the capabilities of Jukebox, an open-source commercially available neural network, to accurately replicate two genres of music often found in rhythm games, artcore and orchestral. A Google Colab notebook provided the computational resources necessary to sample and extend a total of sixteen piano arrangements of both genres. A survey containing selected samples was distributed to a local youth orchestra to gauge people's perceptions of the musicality of AI and human-generated music. Even though humans preferred human-generated music, Jukebox's slightly high rating showed that it was somewhat capable at mimicking the styles of both genres. Despite limitations of Jukebox only using raw audio and a relatively small sample size, it shows promise for the future of AI as a collaborative tool in music production. △ Less

Submitted 28 December, 2023; originally announced January 2024.

arXiv:2210.17181 [pdf, other]

Device Scheduling for Over-the-Air Federated Learning with Differential Privacy

Authors: Na Yan, Kezhi Wang, Cunhua Pan, Kok Keong Chai

Abstract: In this paper, we propose a device scheduling scheme for differentially private over-the-air federated learning (DP-OTA-FL) systems, referred to as S-DPOTAFL, where the privacy of the participants is guaranteed by channel noise. In S-DPOTAFL, the gradients are aligned by the alignment coefficient and aggregated via over-the-air computation (AirComp). The scheme schedules the devices with better ch… ▽ More In this paper, we propose a device scheduling scheme for differentially private over-the-air federated learning (DP-OTA-FL) systems, referred to as S-DPOTAFL, where the privacy of the participants is guaranteed by channel noise. In S-DPOTAFL, the gradients are aligned by the alignment coefficient and aggregated via over-the-air computation (AirComp). The scheme schedules the devices with better channel conditions in the training to avoid the problem that the alignment coefficient is limited by the device with the worst channel condition in the system. We conduct the privacy and convergence analysis to theoretically demonstrate the impact of device scheduling on privacy protection and learning performance. To improve the learning accuracy, we formulate an optimization problem with the goal to minimize the training loss subjecting to privacy and transmit power constraints. Furthermore, we present the condition that the S-DPOTAFL performs better than the DP-OTA-FL without considering device scheduling (NoS-DPOTAFL). The effectiveness of the S-DPOTAFL is validated through simulations. △ Less

Submitted 13 November, 2022; v1 submitted 31 October, 2022; originally announced October 2022.

Comments: arXiv admin note: text overlap with arXiv:2210.07669

arXiv:2210.07669 [pdf, other]

Toward Secure and Private Over-the-Air Federated Learning

Authors: Na Yan, Kezhi Wang, Kangda Zhi, Cunhua Pan, Kok Keong Chai, H. Vincent Poor

Abstract: In this paper, a novel secure and private over-the-air federated learning (SP-OTA-FL) framework is studied where noise is employed to protect data privacy and system security. Specifically, the privacy leakage of user data and the security level of the system are measured by differential privacy (DP) and mean square error security (MSE-security), respectively. To mitigate the impact of noise on le… ▽ More In this paper, a novel secure and private over-the-air federated learning (SP-OTA-FL) framework is studied where noise is employed to protect data privacy and system security. Specifically, the privacy leakage of user data and the security level of the system are measured by differential privacy (DP) and mean square error security (MSE-security), respectively. To mitigate the impact of noise on learning accuracy, we propose a channel-weighted post-processing (CWPP) mechanism, which assigns a smaller weight to the gradient of the device with poor channel conditions. Furthermore, employing CWPP can avoid the issue that the signal-to-noise ratio (SNR) of the overall system is limited by the device with the worst channel condition in aligned over-the-air federated learning (OTA-FL). We theoretically analyze the effect of noise on privacy and security protection and also illustrate the adverse impact of noise on learning performance by conducting convergence analysis. Based on these analytical results, we propose device scheduling policies considering privacy and security protection in different cases of channel noise. In particular, we formulate an integer nonlinear fractional programming problem aiming to minimize the negative impact of noise on the learning process. We obtain the closed-form solution to the optimization problem when the model is with high dimension. For the general case, we propose a secure and private algorithm (SPA) based on the branch-and-bound (BnB) method, which can obtain an optimal solution with low complexity. The effectiveness of the proposed CWPP mechanism and the policies for device selection are validated through simulations. △ Less

Submitted 14 October, 2022; originally announced October 2022.

arXiv:2110.03392 [pdf, other]

Enhanced Memory Network: The novel network structure for Symbolic Music Generation

Authors: ** Li, Haibin Liu, Nan Yan, Lan Wang

Abstract: Symbolic melodies generation is one of the essential tasks for automatic music generation. Recently, models based on neural networks have had a significant influence on generating symbolic melodies. However, the musical context structure is complicated to capture through deep neural networks. Although long short-term memory (LSTM) is attempted to solve this problem through learning order dependenc… ▽ More Symbolic melodies generation is one of the essential tasks for automatic music generation. Recently, models based on neural networks have had a significant influence on generating symbolic melodies. However, the musical context structure is complicated to capture through deep neural networks. Although long short-term memory (LSTM) is attempted to solve this problem through learning order dependence in the musical sequence, it is not capable of capturing musical context with only one note as input for each time step of LSTM. In this paper, we propose a novel Enhanced Memory Network (EMN) with several recurrent units, named Enhanced Memory Unit (EMU), to explicitly modify the internal architecture of LSTM for containing music beat information and reinforces the memory of the latest musical beat through aggregating beat inside the memory gate. In addition, to increase the diversity of generated musical notes, cosine distance among adjacent time steps of hidden states is considered as part of loss functions to avoid a high similarity score that harms the diversity of generated notes. Objective and subjective evaluation results show that the proposed method achieves state-of-the-art performance. Code and music demo are available at https://github.com/qrqrqrqr/EMU △ Less

Submitted 7 October, 2021; originally announced October 2021.

arXiv:2108.08663 [pdf, other]

Unsupervised Cross-Lingual Speech Emotion Recognition Using Pseudo Multilabel

Authors: ** Li, Nan Yan, Lan Wang

Abstract: Speech Emotion Recognition (SER) in a single language has achieved remarkable results through deep learning approaches in the last decade. However, cross-lingual SER remains a challenge in real-world applications due to a great difference between the source and target domain distributions. To address this issue, we propose an unsupervised cross-lingual Neural Network with Pseudo Multilabel (NNPM)… ▽ More Speech Emotion Recognition (SER) in a single language has achieved remarkable results through deep learning approaches in the last decade. However, cross-lingual SER remains a challenge in real-world applications due to a great difference between the source and target domain distributions. To address this issue, we propose an unsupervised cross-lingual Neural Network with Pseudo Multilabel (NNPM) that is trained to learn the emotion similarities between source domain features inside an external memory adjusted to identify emotion in cross-lingual databases. NNPM introduces a novel approach that leverages external memory to store source domain features and generates pseudo multilabel for each target domain data by computing the similarities between the external memory and the target domain features. We evaluate our approach on multiple different languages of speech emotion databases. Experimental results show our proposed approach significantly improves the weighted accuracy (WA) across multiple low-resource languages on Urdu, Skropus, ShEMO, and EMO-DB corpus. To facilitate further research, code is available at https://github.com/happy**/NNPM △ Less

Submitted 7 October, 2021; v1 submitted 19 August, 2021; originally announced August 2021.

arXiv:2108.07980 [pdf, other]

A Multi-level Acoustic Feature Extraction Framework for Transformer Based End-to-End Speech Recognition

Authors: ** Li, Rongfeng Su, Xurong Xie, Nan Yan, Lan Wang

Abstract: Transformer based end-to-end modelling approaches with multiple stream inputs have been achieved great success in various automatic speech recognition (ASR) tasks. An important issue associated with such approaches is that the intermediate features derived from each stream might have similar representations and thus it is lacking of feature diversity, such as the descriptions related to speaker ch… ▽ More Transformer based end-to-end modelling approaches with multiple stream inputs have been achieved great success in various automatic speech recognition (ASR) tasks. An important issue associated with such approaches is that the intermediate features derived from each stream might have similar representations and thus it is lacking of feature diversity, such as the descriptions related to speaker characteristics. To address this issue, this paper proposed a novel multi-level acoustic feature extraction framework that can be easily combined with Transformer based ASR models. The framework consists of two input streams: a shallow stream with high-resolution spectrograms and a deep stream with low-resolution spectrograms. The shallow stream is used to acquire traditional shallow features that is beneficial for the classification of phones or words while the deep stream is used to obtain utterance-level speaker-invariant deep features for improving the feature diversity. A feature correlation based fusion strategy is used to aggregate both features across the frequency and time domains and then fed into the Transformer encoder-decoder module. By using the proposed multi-level acoustic feature extraction framework, state-of-the-art word error rate of 21.7% and 2.5% were obtained on the HKUST Mandarin telephone and Librispeech speech recognition tasks respectively. △ Less

Submitted 7 July, 2022; v1 submitted 18 August, 2021; originally announced August 2021.

Comments: Accepted by Interspeech 2022

arXiv:2108.07974 [pdf, other]

FDN: Finite Difference Network with Hierarchical Convolutional Features for Text-independent Speaker Verification

Authors: ** Li, Nan Yan, Lan Wang

Abstract: In recent years, using raw waveforms as input for deep networks has been widely explored for the speaker verification system. For example, RawNet and RawNet2 extracted speaker's feature embeddings from waveforms automatically for recognizing their voice, which can vastly reduce the front-end computation and obtain state-of-the-art performance. However, these models do not consider the speaker's hi… ▽ More In recent years, using raw waveforms as input for deep networks has been widely explored for the speaker verification system. For example, RawNet and RawNet2 extracted speaker's feature embeddings from waveforms automatically for recognizing their voice, which can vastly reduce the front-end computation and obtain state-of-the-art performance. However, these models do not consider the speaker's high-level behavioral features, such as intonation, indicating each speaker's universal style, rhythm, \textit{etc}. This paper presents a novel network that can handle the intonation information by computing the finite difference of different speakers' utterance variations. Furthermore, a hierarchical way is also designed to enhance the intonation property from coarse to fine to improve the system accuracy. The high-level intonation features are then fused with the low-level embedding features. Experimental results on official VoxCeleb1 test data, VoxCeleb1-E, and VoxCeleb-H protocols show our method outperforms and robustness existing state-of-the-art systems. To facilitate further research, code is available at https://github.com/happy**/FDN △ Less

Submitted 7 October, 2021; v1 submitted 18 August, 2021; originally announced August 2021.

arXiv:2008.04542 [pdf]

An Intelligent Control Strategy for buck DC-DC Converter via Deep Reinforcement Learning

Authors: Chenggang Cui, Nan Yan, Chuanlin Zhang

Abstract: As a typical switching power supply, the DC-DC converter has been widely applied in DC microgrid. Due to the variation of renewable energy generation, research and design of DC-DC converter control algorithm with outstanding dynamic characteristics has significant theoretical and practical application value. To mitigate the bus voltage stability issue in DC microgrid, an innovative intelligent con… ▽ More As a typical switching power supply, the DC-DC converter has been widely applied in DC microgrid. Due to the variation of renewable energy generation, research and design of DC-DC converter control algorithm with outstanding dynamic characteristics has significant theoretical and practical application value. To mitigate the bus voltage stability issue in DC microgrid, an innovative intelligent control strategy for buck DC-DC converter with constant power loads (CPLs) via deep reinforcement learning algorithm is constructed for the first time. In this article, a Markov Decision Process (MDP) model and the deep Q network (DQN) algorithm are defined for DC-DC converter. A model-free based deep reinforcement learning (DRL) control strategy is appropriately designed to adjust the agent-environment interaction through the rewards/penalties mechanism towards achieving converge to nominal voltage. The agent makes approximate decisions by extracting the high-dimensional feature of complex power systems without any prior knowledge. Eventually, the simulation comparison results demonstrate that the proposed controller has stronger self-learning and self-optimization capabilities under the different scenarios. △ Less

Submitted 11 August, 2020; originally announced August 2020.

arXiv:2003.02314 [pdf, other]

The Impact of Hole Geometry on Relative Robustness of In-Painting Networks: An Empirical Study

Authors: Masood S. Mortazavi, Ning Yan

Abstract: In-painting networks use existing pixels to generate appropriate pixels to fill "holes" placed on parts of an image. A 2-D in-painting network's input usually consists of (1) a three-channel 2-D image, and (2) an additional channel for the "holes" to be in-painted in that image. In this paper, we study the robustness of a given in-painting neural network against variations in hole geometry distrib… ▽ More In-painting networks use existing pixels to generate appropriate pixels to fill "holes" placed on parts of an image. A 2-D in-painting network's input usually consists of (1) a three-channel 2-D image, and (2) an additional channel for the "holes" to be in-painted in that image. In this paper, we study the robustness of a given in-painting neural network against variations in hole geometry distributions. We observe that the robustness of an in-painting network is dependent on the probability distribution function (PDF) of the hole geometry presented to it during its training even if the underlying image dataset used (in training and testing) does not alter. We develop an experimental methodology for testing and evaluating relative robustness of in-painting networks against four different kinds of hole geometry PDFs. We examine a number of hypothesis regarding (1) the natural bias of in-painting networks to the hole distribution used for their training, (2) the underlying dataset's ability to differentiate relative robustness as hole distributions vary in a train-test (cross-comparison) grid, and (3) the impact of the directional distribution of edges in the holes and in the image dataset. We present results for L1, PSNR and SSIM quality metrics and develop a specific measure of relative in-painting robustness to be used in cross-comparison grids based on these quality metrics. (One can incorporate other quality metrics in this relative measure.) The empirical work reported here is an initial step in a broader and deeper investigation of "filling the blank" neural networks' sensitivity, robustness and regularization with respect to hole "geometry" PDFs, and it suggests further research in this domain. △ Less

Submitted 4 March, 2020; originally announced March 2020.

arXiv:1906.09884 [pdf, ps, other]

Channel-by-Channel Demosaicking Networks with Embedded Spectral Correlation

Authors: Niu Yan, Jihong Ouyang

Abstract: Demosaicking is standardly the first step in today's Image Signal Processing (ISP) pipeline of digital cameras. It reconstructs image RGB values from the spatially and spectrally sparse Color Filter Array (CFA) samples, which are the original raw data digitized from electrical signals. High quality and low cost demosaicking is not only necessary for photography, but also fundamental for many machi… ▽ More Demosaicking is standardly the first step in today's Image Signal Processing (ISP) pipeline of digital cameras. It reconstructs image RGB values from the spatially and spectrally sparse Color Filter Array (CFA) samples, which are the original raw data digitized from electrical signals. High quality and low cost demosaicking is not only necessary for photography, but also fundamental for many machine vision tasks. This paper proposes an accurate and fast demosaicking model based on Convolutional Neural Networks (CNN) for the Bayer CFA, which is the most popular color filter arrangement adopted by digital camera manufacturers. Observing that each channel has different estimation complexity, we reconstruct each channel by an individual sub-network. Moreover, instead of directly estimating the red and blue values, our model infers the green-red and green-blue color difference. This strategy allows recovering the most complex channel by a light weight network. Although the total size of our model is significantly smaller than the state of the art demosaicking networks, it achieves substantially higher performance in both demosaicking quality and computational cost, as validated by extensive experiments. Source code will be released along with paper publication. △ Less

Submitted 22 April, 2020; v1 submitted 24 June, 2019; originally announced June 2019.

Showing 1–13 of 13 results for author: Yan, N