Search | arXiv e-print repository

Refining Self-Supervised Learnt Speech Representation using Brain Activations

Authors: Hengyu Li, Kangdi Mei, Zhaoci Liu, Yang Ai, Li** Chen, Jie Zhang, Zhenhua Ling

Abstract: It was shown in literature that speech representations extracted by self-supervised pre-trained models exhibit similarities with brain activations of human for speech perception and fine-tuning speech representation models on downstream tasks can further improve the similarity. However, it still remains unclear if this similarity can be used to optimize the pre-trained speech models. In this work,… ▽ More It was shown in literature that speech representations extracted by self-supervised pre-trained models exhibit similarities with brain activations of human for speech perception and fine-tuning speech representation models on downstream tasks can further improve the similarity. However, it still remains unclear if this similarity can be used to optimize the pre-trained speech models. In this work, we therefore propose to use the brain activations recorded by fMRI to refine the often-used wav2vec2.0 model by aligning model representations toward human neural responses. Experimental results on SUPERB reveal that this operation is beneficial for several downstream tasks, e.g., speaker verification, automatic speech recognition, intent classification.One can then consider the proposed method as a new alternative to improve self-supervised speech models. △ Less

Submitted 13 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: accpeted by Interspeech2024

arXiv:2406.02250 [pdf, other]

Multi-Stage Speech Bandwidth Extension with Flexible Sampling Rate Control

Authors: Ye-Xin Lu, Yang Ai, Zheng-Yan Sheng, Zhen-Hua Ling

Abstract: The majority of existing speech bandwidth extension (BWE) methods operate under the constraint of fixed source and target sampling rates, which limits their flexibility in practical applications. In this paper, we propose a multi-stage speech BWE model named MS-BWE, which can handle a set of source and target sampling rate pairs and achieve flexible extensions of frequency bandwidth. The proposed… ▽ More The majority of existing speech bandwidth extension (BWE) methods operate under the constraint of fixed source and target sampling rates, which limits their flexibility in practical applications. In this paper, we propose a multi-stage speech BWE model named MS-BWE, which can handle a set of source and target sampling rate pairs and achieve flexible extensions of frequency bandwidth. The proposed MS-BWE model comprises a cascade of BWE blocks, with each block featuring a dual-stream architecture to realize amplitude and phase extension, progressively painting the speech frequency bands stage by stage. The teacher-forcing strategy is employed to mitigate the discrepancy between training and inference. Experimental results demonstrate that our proposed MS-BWE is comparable to state-of-the-art speech BWE methods in speech quality. Regarding generation efficiency, the one-stage generation of MS-BWE can achieve over one thousand times real-time on GPU and about sixty times on CPU. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.02162 [pdf, other]

BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation

Authors: Hui-Peng Du, Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

Abstract: This paper proposes a novel bidirectional neural vocoder, named BiVocoder, capable both of feature extraction and reverse waveform generation within the short-time Fourier transform (STFT) domain. For feature extraction, the BiVocoder takes amplitude and phase spectra derived from STFT as inputs, transforms them into long-frame-shift and low-dimensional features through convolutional neural networ… ▽ More This paper proposes a novel bidirectional neural vocoder, named BiVocoder, capable both of feature extraction and reverse waveform generation within the short-time Fourier transform (STFT) domain. For feature extraction, the BiVocoder takes amplitude and phase spectra derived from STFT as inputs, transforms them into long-frame-shift and low-dimensional features through convolutional neural networks. The extracted features are demonstrated suitable for direct prediction by acoustic models, supporting its application in text-to-speech (TTS) task. For waveform generation, the BiVocoder restores amplitude and phase spectra from the features by a symmetric network, followed by inverse STFT to reconstruct the speech waveform. Experimental results show that our proposed BiVocoder achieves better performance compared to some baseline vocoders, by comprehensively considering both synthesized speech quality and inference speed for both analysis-synthesis and TTS tasks. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2404.08857 [pdf, other]

Voice Attribute Editing with Text Prompt

Authors: Zhengyan Sheng, Yang Ai, Li-Juan Liu, Jia Pan, Zhen-Hua Ling

Abstract: Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modifications to voice attributes according to the actions described in the text prompt. To solve this t… ▽ More Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modifications to voice attributes according to the actions described in the text prompt. To solve this task, VoxEditor, an end-to-end generative model, is proposed. In VoxEditor, addressing the insufficiency of text prompt, a Residual Memory (ResMem) block is designed, that efficiently maps voice attributes and these descriptors into the shared feature space. Additionally, the ResMem block is enhanced with a voice attribute degree prediction (VADP) block to align voice attributes with corresponding descriptors, addressing the imprecision of text prompt caused by non-quantitative descriptions of voice attributes. We also establish the open-source VCTK-RVA dataset, which leads the way in manual annotations detailing voice characteristic differences among different speakers. Extensive experiments demonstrate the effectiveness and generalizability of our proposed method in terms of both objective and subjective metrics. The dataset and audio samples are available on the website. △ Less

Submitted 12 April, 2024; originally announced April 2024.

arXiv:2403.17378 [pdf, other]

Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrap** Losses for Speech Generation Tasks

Authors: Yang Ai, Zhen-Hua Ling

Abstract: This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional la… ▽ More This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrap**, we design anti-wrap** training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrap** function. We mathematically demonstrate that the anti-wrap** function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: Accepted by IEEE Transactions on Audio, Speech and Language Processing. arXiv admin note: substantial text overlap with arXiv:2211.15974

arXiv:2402.10533 [pdf, other]

APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding

Authors: Yang Ai, Xiao-Hang Jiang, Ye-Xin Lu, Hui-Peng Du, Zhen-Hua Ling

Abstract: This paper introduces a novel neural audio codec targeting high waveform sampling rates and low bitrates named APCodec, which seamlessly integrates the strengths of parametric codecs and waveform codecs. The APCodec revolutionizes the process of audio encoding and decoding by concurrently handling the amplitude and phase spectra as audio parametric characteristics like parametric codecs. It is com… ▽ More This paper introduces a novel neural audio codec targeting high waveform sampling rates and low bitrates named APCodec, which seamlessly integrates the strengths of parametric codecs and waveform codecs. The APCodec revolutionizes the process of audio encoding and decoding by concurrently handling the amplitude and phase spectra as audio parametric characteristics like parametric codecs. It is composed of an encoder and a decoder with the modified ConvNeXt v2 network as the backbone, connected by a quantizer based on the residual vector quantization (RVQ) mechanism. The encoder compresses the audio amplitude and phase spectra in parallel, amalgamating them into a continuous latent code at a reduced temporal resolution. This code is subsequently quantized by the quantizer. Ultimately, the decoder reconstructs the audio amplitude and phase spectra in parallel, and the decoded waveform is obtained by inverse short-time Fourier transform. To ensure the fidelity of decoded audio like waveform codecs, spectral-level loss, quantization loss, and generative adversarial network (GAN) based loss are collectively employed for training the APCodec. To support low-latency streamable inference, we employ feed-forward layers and causal convolutional layers in APCodec, incorporating a knowledge distillation training strategy to enhance the quality of decoded audio. Experimental results confirm that our proposed APCodec can encode 48 kHz audio at bitrate of just 6 kbps, with no significant degradation in the quality of the decoded audio. At the same bitrate, our proposed APCodec also demonstrates superior decoded audio quality and faster generation speed compared to well-known codecs, such as SoundStream, Encodec, HiFi-Codec and AudioDec. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2401.06387 [pdf, other]

Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction

Authors: Ye-Xin Lu, Yang Ai, Hui-Peng Du, Zhen-Hua Ling

Abstract: Speech bandwidth extension (BWE) refers to widening the frequency bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both high-quality and efficient wideband speech waveform generation. The propose… ▽ More Speech bandwidth extension (BWE) refers to widening the frequency bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both high-quality and efficient wideband speech waveform generation. The proposed AP-BWE generator is entirely based on convolutional neural networks (CNNs). It features a dual-stream architecture with mutual interaction, where the amplitude stream and the phase stream communicate with each other and respectively extend the high-frequency components from the input narrowband amplitude and phase spectra. To improve the naturalness of the extended speech signals, we employ a multi-period discriminator at the waveform level and design a pair of multi-resolution amplitude and phase discriminators at the spectral level, respectively. Experimental results demonstrate that our proposed AP-BWE achieves state-of-the-art performance in terms of speech quality for BWE tasks targeting sampling rates of both 16 kHz and 48 kHz. In terms of generation efficiency, due to the all-convolutional architecture and all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1 times faster than real-time on a single CPU. Notably, to our knowledge, AP-BWE is the first to achieve the direct extension of the high-frequency phase spectrum, which is beneficial for improving the effectiveness of existing BWE methods. △ Less

Submitted 12 January, 2024; originally announced January 2024.

Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2311.11545 [pdf, other]

APNet2: High-quality and High-efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase Spectra

Authors: Hui-Peng Du, Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

Abstract: In our previous work, we proposed a neural vocoder called APNet, which directly predicts speech amplitude and phase spectra with a 5 ms frame shift in parallel from the input acoustic features, and then reconstructs the 16 kHz speech waveform using inverse short-time Fourier transform (ISTFT). APNet demonstrates the capability to generate synthesized speech of comparable quality to the HiFi-GAN vo… ▽ More In our previous work, we proposed a neural vocoder called APNet, which directly predicts speech amplitude and phase spectra with a 5 ms frame shift in parallel from the input acoustic features, and then reconstructs the 16 kHz speech waveform using inverse short-time Fourier transform (ISTFT). APNet demonstrates the capability to generate synthesized speech of comparable quality to the HiFi-GAN vocoder but with a considerably improved inference speed. However, the performance of the APNet vocoder is constrained by the waveform sampling rate and spectral frame shift, limiting its practicality for high-quality speech synthesis. Therefore, this paper proposes an improved iteration of APNet, named APNet2. The proposed APNet2 vocoder adopts ConvNeXt v2 as the backbone network for amplitude and phase predictions, expecting to enhance the modeling capability. Additionally, we introduce a multi-resolution discriminator (MRD) into the GAN-based losses and optimize the form of certain losses. At a common configuration with a waveform sampling rate of 22.05 kHz and spectral frame shift of 256 points (i.e., approximately 11.6ms), our proposed APNet2 vocoder outperformed the original APNet and Vocos vocoders in terms of synthesized speech quality. The synthesized speech quality of APNet2 is also comparable to that of HiFi-GAN and iSTFTNet, while offering a significantly faster inference speed. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2309.10455 [pdf, other]

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

Authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Abstract: Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images duri… ▽ More Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images during inference, we first propose to employ knowledge distillation during training to investigate the feasibility of leveraging tongue-related information without directly inputting ultrasound tongue images. Specifically, we guide an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model, thus transferring tongue-related knowledge. To better model the alignment between the lip and tongue modalities, we further propose the introduction of a lip-tongue key-value memory network into the AV-SE model. This network enables the retrieval of tongue features based on readily available lip features, thereby assisting the subsequent speech enhancement task. Experimental results demonstrate that both methods significantly improve the quality and intelligibility of the enhanced speech compared to traditional lip-based AV-SE baselines. Moreover, both proposed methods exhibit strong generalization performance on unseen speakers and in the presence of unseen noises. Furthermore, phone error rate (PER) analysis of automatic speech recognition (ASR) reveals that while all phonemes benefit from introducing ultrasound tongue images, palatal and velar consonants benefit most. △ Less

Submitted 20 November, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

Comments: Submmited to IEEE/ACM Transactions on Audio, Speech and Language Processing. arXiv admin note: text overlap with arXiv:2305.14933

arXiv:2309.09470 [pdf, other]

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

Authors: Zheng-Yan Sheng, Yang Ai, Yan-Nian Chen, Zhen-Hua Ling

Abstract: This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely relying on a single face image of the target speaker. To address this task, we propose a face-voice memory-based zero-shot FaceVC method. This method leverages a memo… ▽ More This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely relying on a single face image of the target speaker. To address this task, we propose a face-voice memory-based zero-shot FaceVC method. This method leverages a memory-based face-voice alignment module, in which slots act as the bridge to align these two modalities, allowing for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model to our zero-shot FaceVC model. Considering the differences between FaceVC and traditional voice conversion tasks, systematic subjective and objective metrics are designed to thoroughly evaluate the homogeneity, diversity and consistency of voice characteristics controlled by face images. Through extensive experiments, we demonstrate the superiority of our proposed method on the zero-shot FaceVC task. Samples are presented on our demo website. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2308.08926 [pdf, other]

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

Authors: Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

Abstract: Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrap** characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech En… ▽ More Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrap** characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network that explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet comprises a Transformer-embedded encoder-decoder architecture. The encoder aims to encode the input distorted magnitude and phase spectra into time-frequency representations, which are further fed into time-frequency Transformers for alternatively capturing time and frequency dependencies. The decoder comprises a magnitude mask decoder and a phase decoder, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude masking architecture and a phase parallel estimation architecture, respectively. Multi-level loss functions explicitly defined on the magnitude spectra, wrapped phase spectra, and short-time complex spectra are adopted to jointly train the MP-SENet model. A metric discriminator is further employed to compensate for the incomplete correlation between these losses and human auditory perception. Experimental results demonstrate that our proposed MP-SENet achieves state-of-the-art performance across multiple speech enhancement tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it further mitigates the compensation effect between the magnitude and phase by explicit phase estimation, elevating the perceptual quality of enhanced speech. △ Less

Submitted 1 April, 2024; v1 submitted 17 August, 2023; originally announced August 2023.

Comments: Submmited to IEEE Transactions on Audio, Speech and Language Processing

arXiv:2308.08850 [pdf, other]

Long-frame-shift Neural Speech Phase Prediction with Spectral Continuity Enhancement and Interpolation Error Compensation

Authors: Yang Ai, Ye-Xin Lu, Zhen-Hua Ling

Abstract: Speech phase prediction, which is a significant research focus in the field of signal processing, aims to recover speech phase spectra from amplitude-related features. However, existing speech phase prediction methods are constrained to recovering phase spectra with short frame shifts, which are considerably smaller than the theoretical upper bound required for exact waveform reconstruction of sho… ▽ More Speech phase prediction, which is a significant research focus in the field of signal processing, aims to recover speech phase spectra from amplitude-related features. However, existing speech phase prediction methods are constrained to recovering phase spectra with short frame shifts, which are considerably smaller than the theoretical upper bound required for exact waveform reconstruction of short-time Fourier transform (STFT). To tackle this issue, we present a novel long-frame-shift neural speech phase prediction (LFS-NSPP) method which enables precise prediction of long-frame-shift phase spectra from long-frame-shift log amplitude spectra. The proposed method consists of three stages: interpolation, prediction and decimation. The short-frame-shift log amplitude spectra are first constructed from long-frame-shift ones through frequency-by-frequency interpolation to enhance the spectral continuity, and then employed to predict short-frame-shift phase spectra using an NSPP model, thereby compensating for interpolation errors. Ultimately, the long-frame-shift phase spectra are obtained from short-frame-shift ones through frame-by-frame decimation. Experimental results show that the proposed LFS-NSPP method can yield superior quality in predicting long-frame-shift phase spectra than the original NSPP model and other signal-processing-based phase estimation algorithms. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: Published at IEEE Signal Processing Letters

arXiv:2305.14933 [pdf, other]

doi 10.21437/Interspeech.2023-780

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation

Authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Abstract: Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes further incorporating ultrasound tongue images to improve lip-based AV-SE systems' performance. Knowledge distillation is employed at the training stage to address the challeng… ▽ More Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes further incorporating ultrasound tongue images to improve lip-based AV-SE systems' performance. Knowledge distillation is employed at the training stage to address the challenge of acquiring ultrasound tongue images during inference, enabling an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model. Experimental results demonstrate significant improvements in the quality and intelligibility of the speech enhanced by the proposed method compared to the traditional audio-lip speech enhancement baselines. Further analysis using phone error rates (PER) of automatic speech recognition (ASR) shows that palatal and velar consonants benefit most from the introduction of ultrasound tongue images. △ Less

Submitted 20 November, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: Published in InterSpeech 2023

Journal ref: Proc. INTERSPEECH 2023, 844-848 (2023)

arXiv:2305.14359 [pdf, other]

Zero-shot personalized lip-to-speech synthesis with face image based voice control

Authors: Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling

Abstract: Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted from natural reference speech and are unavailabl… ▽ More Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted from natural reference speech and are unavailable when only the silent video of an unseen speaker is given. In this paper, we propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities. A variational autoencoder is adopted to disentangle the speaker identity and linguistic content representations, which enables speaker embeddings to control the voice characteristics of synthetic speech for unseen speakers. Furthermore, we propose associated cross-modal representation learning to promote the ability of face-based speaker embeddings (FSE) on voice control. Extensive experiments verify the effectiveness of the proposed method whose synthetic utterances are more natural and matching with the personality of input video than the compared methods. To our best knowledge, this paper makes the first attempt on zero-shot personalized Lip2Speech synthesis with a face image rather than reference audio to control voice characteristics. △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: ICASSP 2023

arXiv:2305.13686 [pdf, other]

doi 10.21437/Interspeech.2023-1441

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra

Authors: Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

Abstract: This paper proposes MP-SENet, a novel Speech Enhancement Network which directly denoises Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec architecture in which the encoder and decoder are bridged by convolution-augmented transformers. The encoder aims to encode time-frequency representations from the input noisy magnitude and phase spectra. The decoder is composed of p… ▽ More This paper proposes MP-SENet, a novel Speech Enhancement Network which directly denoises Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec architecture in which the encoder and decoder are bridged by convolution-augmented transformers. The encoder aims to encode time-frequency representations from the input noisy magnitude and phase spectra. The decoder is composed of parallel magnitude mask decoder and phase decoder, directly recovering clean magnitude spectra and clean-wrapped phase spectra by incorporating learnable sigmoid activation and parallel phase estimation architecture, respectively. Multi-level losses defined on magnitude spectra, phase spectra, short-time complex spectra, and time-domain waveforms are used to train the MP-SENet model jointly. Experimental results show that our proposed MP-SENet achieves a PESQ of 3.50 on the public VoiceBank+DEMAND dataset and outperforms existing advanced speech enhancement methods. △ Less

Submitted 23 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech 2023

arXiv:2305.07952 [pdf, other]

APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra

Authors: Yang Ai, Zhen-Hua Ling

Abstract: This paper presents a novel neural vocoder named APNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra directly. The APNet vocoder is composed of an amplitude spectrum predictor (ASP) and a phase spectrum predictor (PSP). The ASP is a residual convolution network which predicts frame-level log amplitude spectra from acoustic features. The PSP al… ▽ More This paper presents a novel neural vocoder named APNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra directly. The APNet vocoder is composed of an amplitude spectrum predictor (ASP) and a phase spectrum predictor (PSP). The ASP is a residual convolution network which predicts frame-level log amplitude spectra from acoustic features. The PSP also adopts a residual convolution network using acoustic features as input, then passes the output of this network through two parallel linear convolution layers respectively, and finally integrates into a phase calculation formula to estimate frame-level phase spectra. Finally, the outputs of ASP and PSP are combined to reconstruct speech waveforms by inverse short-time Fourier transform (ISTFT). All operations of the ASP and PSP are performed at the frame level. We train the ASP and PSP jointly and define multilevel loss functions based on amplitude mean square error, phase anti-wrap** error, short-time spectral inconsistency error and time domain reconstruction error. Experimental results show that our proposed APNet vocoder achieves an approximately 8x faster inference speed than HiFi-GAN v1 on a CPU due to the all-frame-level operations, while its synthesized speech quality is comparable to HiFi-GAN v1. The synthesized speech quality of the APNet vocoder is also better than that of several equally efficient models. Ablation experiments also confirm that the proposed parallel phase estimation architecture is essential to phase modeling and the proposed loss functions are helpful for improving the synthesized speech quality. △ Less

Submitted 13 May, 2023; originally announced May 2023.

Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing. Codes are available

arXiv:2304.13270 [pdf, other]

doi 10.1007/978-981-99-2401-1_6

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

Authors: Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

Abstract: This paper proposes a source-filter-based generative adversarial neural vocoder named SF-GAN, which achieves high-fidelity waveform generation from input acoustic features by introducing F0-based source excitation signals to a neural filter framework. The SF-GAN vocoder is composed of a source module and a resolution-wise conditional filter module and is trained based on generative adversarial str… ▽ More This paper proposes a source-filter-based generative adversarial neural vocoder named SF-GAN, which achieves high-fidelity waveform generation from input acoustic features by introducing F0-based source excitation signals to a neural filter framework. The SF-GAN vocoder is composed of a source module and a resolution-wise conditional filter module and is trained based on generative adversarial strategies. The source module produces an excitation signal from the F0 information, then the resolution-wise convolutional filter module combines the excitation signal with processed acoustic features at various temporal resolutions and finally reconstructs the raw waveform. The experimental results show that our proposed SF-GAN vocoder outperforms the state-of-the-art HiFi-GAN and Fre-GAN in both analysis-synthesis (AS) and text-to-speech (TTS) tasks, and the synthesized speech quality of SF-GAN is comparable to the ground-truth audio. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: Accepted by NCMMSC 2022

Journal ref: Man-Machine Speech Communication, 2022, pp.68-80

arXiv:2304.11106 [pdf, other]

A Convolutional Spiking Network for Gesture Recognition in Brain-Computer Interfaces

Authors: Yiming Ai, Bipin Rajendran

Abstract: Brain-computer interfaces are being explored for a wide variety of therapeutic applications. Typically, this involves measuring and analyzing continuous-time electrical brain activity via techniques such as electrocorticogram (ECoG) or electroencephalography (EEG) to drive external devices. However, due to the inherent noise and variability in the measurements, the analysis of these signals is cha… ▽ More Brain-computer interfaces are being explored for a wide variety of therapeutic applications. Typically, this involves measuring and analyzing continuous-time electrical brain activity via techniques such as electrocorticogram (ECoG) or electroencephalography (EEG) to drive external devices. However, due to the inherent noise and variability in the measurements, the analysis of these signals is challenging and requires offline processing with significant computational resources. In this paper, we propose a simple yet efficient machine learning-based approach for the exemplary problem of hand gesture classification based on brain signals. We use a hybrid machine learning approach that uses a convolutional spiking neural network employing a bio-inspired event-driven synaptic plasticity rule for unsupervised feature learning of the measured analog signals encoded in the spike domain. We demonstrate that this approach generalizes to different subjects with both EEG and ECoG data and achieves superior accuracy in the range of 92.74-97.07% in identifying different hand gesture classes and motor imagery tasks. △ Less

Submitted 27 April, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

Comments: Accepted at AICAS 2023

arXiv:2304.05574 [pdf, other]

Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training

Authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Abstract: This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing sound. This task falls under the umbrella of articulatory-to-acoustic conversion, and may also be refered to as a silent speech interface. We propose to employ a method… ▽ More This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing sound. This task falls under the umbrella of articulatory-to-acoustic conversion, and may also be refered to as a silent speech interface. We propose to employ a method built on pseudo target generation and domain adversarial training with an iterative training strategy to improve the intelligibility and naturalness of the speech recovered from silent tongue and lip articulation. Experiments show that our proposed method significantly improves the intelligibility and naturalness of the reconstructed speech in silent speaking mode compared to the baseline TaLNet model. When using an automatic speech recognition (ASR) model to measure intelligibility, the word error rate (WER) of our proposed method decreases by over 15% compared to the baseline. In addition, our proposed method also outperforms the baseline on the intelligibility of the speech reconstructed in vocalized articulating mode, reducing the WER by approximately 10%. △ Less

Submitted 11 April, 2023; originally announced April 2023.

Comments: To be published in ICASSP2023

arXiv:2211.15974 [pdf, other]

Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrap** Losses

Authors: Yang Ai, Zhen-Hua Ling

Abstract: This paper presents a novel speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra by neural networks. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is composed of two parallel linear convolutional layers and a phase calculation formula, imitating the proce… ▽ More This paper presents a novel speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra by neural networks. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is composed of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrap**, we design anti-wrap** training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrap** function. Experimental results show that our proposed neural speech phase prediction model outperforms the iterative Griffin-Lim algorithm and other neural network-based method, in terms of both reconstructed speech quality and generation speed. △ Less

Submitted 16 February, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: Accepted by ICASSP 2023. Codes are available

arXiv:2111.09413 [pdf, other]

Mixed Dual-Hop IRS-Assisted FSO-RF Communication System with H-ARQ Protocols

Authors: Gyan Deep Verma, Aashish Mathur, Yun Ai, Michael Cheffena

Abstract: Intelligent reflecting surface (IRS) is an emerging key technology for the fifth-generation (5G) and beyond wireless communication systems to provide more robust and reliable communication links. In this paper, we propose a mixed dual-hop free-space optical (FSO)-radio frequency (RF) communication system that serves the end user via a decode-and-forward (DF) relay employing hybrid automatic repeat… ▽ More Intelligent reflecting surface (IRS) is an emerging key technology for the fifth-generation (5G) and beyond wireless communication systems to provide more robust and reliable communication links. In this paper, we propose a mixed dual-hop free-space optical (FSO)-radio frequency (RF) communication system that serves the end user via a decode-and-forward (DF) relay employing hybrid automatic repeat request (HARQ) protocols on both hops. Novel closed-form expressions of the probability density function (PDF) and cumulative density function (CDF) of the equivalent end-to-end signal-to-noise ratio (SNR) are computed for the considered system. Utilizing the obtained statistics functions, we derive the outage probability (OP) and packet error rate (PER) of the proposed system by considering generalized detection techniques on the source-to-relay (S-R) link with H-ARQ protocol and IRS having phase error. We obtain useful insights into the system performance through the asymptotic analysis which aids to compute the diversity gain. The derived analytical results are validated using Monte Carlo simulation. △ Less

Submitted 20 August, 2021; originally announced November 2021.

Comments: 5 pages, 6 figures

arXiv:2103.11657 [pdf, other]

Communication Technologies for Smart Grid: A Comprehensive Survey

Authors: Fredrik Ege Abrahamsen, Yun Ai, Michael Cheffena

Abstract: With the ongoing trends in the energy sector such as vehicular electrification and renewable energy, smart grid is clearly playing a more and more important role in the electric power system industry. One essential feature of the smart grid is the information flow over the high-speed, reliable and secure data communication network in order to manage the complex power systems effectively and intell… ▽ More With the ongoing trends in the energy sector such as vehicular electrification and renewable energy, smart grid is clearly playing a more and more important role in the electric power system industry. One essential feature of the smart grid is the information flow over the high-speed, reliable and secure data communication network in order to manage the complex power systems effectively and intelligently. Smart grids utilize bidirectional communication to function where traditional power grids mainly only use one-way communication. The communication requirements and suitable technique differ depending on the specific environment and scenario. In this paper, we provide a comprehensive and up-to-date survey on the communication technologies used in the smart grid, including the communication requirements, physical layer technologies, network architectures, and research challenges. This survey aims to help the readers identify the potential research problems in the continued research on the topic of smart grid communications. △ Less

Submitted 7 June, 2021; v1 submitted 22 March, 2021; originally announced March 2021.

arXiv:2103.07163 [pdf, other]

doi 10.1109/TVT.2021.3066219

Secure Outage Analysis of FSO Communications Over Arbitrarily Correlated Málaga Turbulence Channels

Authors: Yun Ai, Aashish Mathur, Long Kong, Michael Cheffena

Abstract: In this paper, we analyze the secrecy outage performance for more realistic eavesdrop** scenario of free-space optical (FSO) communications, where the main and wiretap links are correlated. The FSO fading channels are modeled by the well-known Málaga distribution. Exact expressions for the secrecy performance metrics such as secrecy outage probability (SOP) and probability of the non zero secrec… ▽ More In this paper, we analyze the secrecy outage performance for more realistic eavesdrop** scenario of free-space optical (FSO) communications, where the main and wiretap links are correlated. The FSO fading channels are modeled by the well-known Málaga distribution. Exact expressions for the secrecy performance metrics such as secrecy outage probability (SOP) and probability of the non zero secrecy capacity (PNZSC) are derived, and asymptotic analysis on the SOP is also conducted. The obtained results reveal useful insights on the effect of channel correlation on FSO communications. Counterintuitively, it is found that the secrecy outage performance demonstrates a non-monotonic behavior with the increase of correlation. More specifically, there is an SNR penalty for achieving a target SOP as the correlation increases within some range. However, when the correlation is further increased beyond some threshold, the SOP performance improves significantly. △ Less

Submitted 12 March, 2021; originally announced March 2021.

Comments: 6 pages, 5 figures

arXiv:2011.14899 [pdf, ps, other]

Secure Vehicular Communications through Reconfigurable Intelligent Surfaces

Authors: Yun Ai, Felipe A. P. de Figueiredo, Long Kong, Michael Cheffena, Symeon Chatzinotas, Björn Ottersten

Abstract: Reconfigurable intelligent surfaces (RIS) is considered as a revolutionary technique to improve the wireless system performance by reconfiguring the radio wave propagation environment artificially. Motivated by the potential of RIS in vehicular networks, we analyze the secrecy outage performance of RIS-aided vehicular communications in this paper. More specifically, two vehicular communication sce… ▽ More Reconfigurable intelligent surfaces (RIS) is considered as a revolutionary technique to improve the wireless system performance by reconfiguring the radio wave propagation environment artificially. Motivated by the potential of RIS in vehicular networks, we analyze the secrecy outage performance of RIS-aided vehicular communications in this paper. More specifically, two vehicular communication scenarios are considered, i.e., a vehicular-to-vehicular (V2V) communication where the RIS acts as a relay and a vehicular-to-infrastructure (V2I) scenario where the RIS functions as the receiver. In both scenarios, a passive eavesdropper is present attempting to retrieve the transmitted information. Closed-form expressions for the secrecy outage probability (SOP) are derived and verified. The results demonstrate the potential of improving secrecy with the aid of RIS under both V2V and V2I communications. △ Less

Submitted 2 December, 2020; v1 submitted 30 November, 2020; originally announced November 2020.

Comments: 6 pages

arXiv:2011.05038 [pdf, other]

Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model

Authors: Haoyu Li, Yang Ai, Junichi Yamagishi

Abstract: High-quality speech corpora are essential foundations for most speech applications. However, such speech data are expensive and limited since they are collected in professional recording environments. In this work, we propose an encoder-decoder neural network to automatically enhance low-quality recordings to professional high-quality recordings. To address channel variability, we first filter out… ▽ More High-quality speech corpora are essential foundations for most speech applications. However, such speech data are expensive and limited since they are collected in professional recording environments. In this work, we propose an encoder-decoder neural network to automatically enhance low-quality recordings to professional high-quality recordings. To address channel variability, we first filter out the channel characteristics from the original input audio using the encoder network with adversarial training. Next, we disentangle the channel factor from a reference audio. Conditioned on this factor, an auto-regressive decoder is then used to predict the target-environment Mel spectrogram. Finally, we apply a neural vocoder to synthesize the speech waveform. Experimental results show that the proposed system can generate a professional high-quality speech waveform when setting high-quality audio as the reference. It also improves speech enhancement performance compared with several state-of-the-art baseline systems. △ Less

Submitted 10 November, 2020; originally announced November 2020.

Comments: 8 pages. Accepted to IEEE SLT 2021

arXiv:2011.03955 [pdf, other]

Denoising-and-Dereverberation Hierarchical Neural Vocoder for Robust Waveform Generation

Authors: Yang Ai, Haoyu Li, Xin Wang, Junichi Yamagishi, Zhenhua Ling

Abstract: This paper presents a denoising and dereverberation hierarchical neural vocoder (DNR-HiNet) to convert noisy and reverberant acoustic features into a clean speech waveform. We implement it mainly by modifying the amplitude spectrum predictor (ASP) in the original HiNet vocoder. This modified denoising and dereverberation ASP (DNR-ASP) can predict clean log amplitude spectra (LAS) from input degrad… ▽ More This paper presents a denoising and dereverberation hierarchical neural vocoder (DNR-HiNet) to convert noisy and reverberant acoustic features into a clean speech waveform. We implement it mainly by modifying the amplitude spectrum predictor (ASP) in the original HiNet vocoder. This modified denoising and dereverberation ASP (DNR-ASP) can predict clean log amplitude spectra (LAS) from input degraded acoustic features. To achieve this, the DNR-ASP first predicts the noisy and reverberant LAS, noise LAS related to the noise information, and room impulse response related to the reverberation information then performs initial denoising and dereverberation. The initial processed LAS are then enhanced by another neural network as the final clean LAS. To further improve the quality of the generated clean LAS, we also introduce a bandwidth extension model and frequency resolution extension model in the DNR-ASP. The experimental results indicate that the DNR-HiNet vocoder was able to generate a denoised and dereverberated waveform given noisy and reverberant acoustic features and outperformed the original HiNet vocoder and a few other neural vocoders. We also applied the DNR-HiNet vocoder to speech enhancement tasks, and its performance was competitive with several advanced speech enhancement methods. △ Less

Submitted 8 November, 2020; originally announced November 2020.

Comments: Accepted by SLT 2021

arXiv:2008.06182 [pdf, other]

Online Speaker Adaptation for WaveNet-based Neural Vocoders

Authors: Qiuchen Huang, Yang Ai, Zhenhua Ling

Abstract: In this paper, we propose an online speaker adaptation method for WaveNet-based neural vocoders in order to improve their performance on speaker-independent waveform generation. In this method, a speaker encoder is first constructed using a large speaker-verification dataset which can extract a speaker embedding vector from an utterance pronounced by an arbitrary speaker. At the training stage, a… ▽ More In this paper, we propose an online speaker adaptation method for WaveNet-based neural vocoders in order to improve their performance on speaker-independent waveform generation. In this method, a speaker encoder is first constructed using a large speaker-verification dataset which can extract a speaker embedding vector from an utterance pronounced by an arbitrary speaker. At the training stage, a speaker-aware WaveNet vocoder is then built using a multi-speaker dataset which adopts both acoustic feature sequences and speaker embedding vectors as conditions.At the generation stage, we first feed the acoustic feature sequence from a test speaker into the speaker encoder to obtain the speaker embedding vector of the utterance. Then, both the speaker embedding vector and acoustic features pass the speaker-aware WaveNet vocoder to reconstruct speech waveforms. Experimental results demonstrate that our method can achieve a better objective and subjective performance on reconstructing waveforms of unseen speakers than the conventional speaker-independent WaveNet vocoder. △ Less

Submitted 13 August, 2020; originally announced August 2020.

Comments: 6 pages, 2 figures, 4 tables

arXiv:2007.10786 [pdf]

Comparison of Different Methods for Time Sequence Prediction in Autonomous Vehicles

Authors: Teng Liu, Bin Tian, Yunfeng Ai, Long Chen, Fei Liu, Dongpu Cao

Abstract: As a combination of various kinds of technologies, autonomous vehicles could complete a series of driving tasks by itself, such as perception, decision-making, planning, and control. Since there is no human driver to handle the emergency situation, future transportation information is significant for automated vehicles. This paper proposes different methods to forecast the time series for autonomo… ▽ More As a combination of various kinds of technologies, autonomous vehicles could complete a series of driving tasks by itself, such as perception, decision-making, planning, and control. Since there is no human driver to handle the emergency situation, future transportation information is significant for automated vehicles. This paper proposes different methods to forecast the time series for autonomous vehicles, which are the nearest neighborhood (NN), fuzzy coding (FC), and long short term memory (LSTM). First, the formulation and operational process for these three approaches are introduced. Then, the vehicle velocity is regarded as a case study and the real-world dataset is utilized to predict future information via these techniques. Finally, the performance, merits, and drawbacks of the presented methods are analyzed and discussed. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Comments: 6 pages, 11 figures

arXiv:2005.07379 [pdf, other]

Reverberation Modeling for Source-Filter-based Neural Vocoder

Authors: Yang Ai, Xin Wang, Junichi Yamagishi, Zhen-Hua Ling

Abstract: This paper presents a reverberation module for source-filter-based neural vocoders that improves the performance of reverberant effect modeling. This module uses the output waveform of neural vocoders as an input and produces a reverberant waveform by convolving the input with a room impulse response (RIR). We propose two approaches to parameterizing and estimating the RIR. The first approach assu… ▽ More This paper presents a reverberation module for source-filter-based neural vocoders that improves the performance of reverberant effect modeling. This module uses the output waveform of neural vocoders as an input and produces a reverberant waveform by convolving the input with a room impulse response (RIR). We propose two approaches to parameterizing and estimating the RIR. The first approach assumes a global time-invariant (GTI) RIR and directly learns the values of the RIR on a training dataset. The second approach assumes an utterance-level time-variant (UTV) RIR, which is invariant within one utterance but varies across utterances, and uses another neural network to predict the RIR values. We add the proposed reverberation module to the phase spectrum predictor (PSP) of a HiNet vocoder and jointly train the model. Experimental results demonstrate that the proposed module was helpful for modeling the reverberation effect and improving the perceived quality of generated reverberant speech. The UTV-RIR was shown to be more robust than the GTI-RIR to unknown reverberation conditions and achieved a perceptually better reverberation effect. △ Less

Submitted 15 May, 2020; originally announced May 2020.

Comments: Submitted to Interspeech 2020

arXiv:2004.13761 [pdf]

A Method for Vehicle Collision Risk Assessment through Inferring Driver's Braking Actions in Near-Crash Situations

Authors: Liqun Peng, Miguel Angel Sotelo, Yi He, Yunfei Ai, Zhixiong Li

Abstract: Driving information and data under potential vehicle crashes create opportunities for extensive real-world observations of driver behaviors and relevant factors that significantly influence the driving safety in emergency scenarios. Furthermore, the availability of such data also enhances the collision avoidance systems (CASs) by evaluating driver's actions in near-crash scenarios and providing ti… ▽ More Driving information and data under potential vehicle crashes create opportunities for extensive real-world observations of driver behaviors and relevant factors that significantly influence the driving safety in emergency scenarios. Furthermore, the availability of such data also enhances the collision avoidance systems (CASs) by evaluating driver's actions in near-crash scenarios and providing timely warnings. These applications motivate the need for heuristic tools capable of inferring relationship among driving risk, driver/vehicle characteristics, and road environment. In this paper, we acquired amount of real-world driving data and built a comprehensive dataset, which contains multiple "driver-vehicle-road" attributes. The proposed method works in two steps. In the first step, a variable precision rough set (VPRS) based classification technique is applied to draw a reduced core subset from field driving dataset, which presents the essential attributes set most relevant to driving safety assessment. In the second step, we design a decision strategy by introducing mutual information entropy to quantify the significance of each attribute, then a representative index through accumulation of weighted "driver-vehicle-road" factors is calculated to reflect the driving risk for actual situation. The performance of the proposed method is demonstrated in an offline analysis of the driving data collected in field trials, where the aim is to infer the emergency braking actions in next short term. The results indicate that our proposed model is a good alternative for providing improved warnings in real-time because of its high prediction accuracy and stability. △ Less

Submitted 28 April, 2020; originally announced April 2020.

Comments: 14 pages

arXiv:2004.07832 [pdf, other]

Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders

Authors: Yang Ai, Zhen-Hua Ling

Abstract: In our previous work, we have proposed a neural vocoder called HiNet which recovers speech waveforms by predicting amplitude and phase spectra hierarchically from input acoustic features. In HiNet, the amplitude spectrum predictor (ASP) predicts log amplitude spectra (LAS) from input acoustic features. This paper proposes a novel knowledge-and-data-driven ASP (KDD-ASP) to improve the conventional… ▽ More In our previous work, we have proposed a neural vocoder called HiNet which recovers speech waveforms by predicting amplitude and phase spectra hierarchically from input acoustic features. In HiNet, the amplitude spectrum predictor (ASP) predicts log amplitude spectra (LAS) from input acoustic features. This paper proposes a novel knowledge-and-data-driven ASP (KDD-ASP) to improve the conventional one. First, acoustic features (i.e., F0 and mel-cepstra) pass through a knowledge-driven LAS recovery module to obtain approximate LAS (ALAS). This module is designed based on the combination of STFT and source-filter theory, in which the source part and the filter part are designed based on input F0 and mel-cepstra, respectively. Then, the recovered ALAS are processed by a data-driven LAS refinement module which consists of multiple trainable convolutional layers to get the final LAS. Experimental results show that the HiNet vocoder using KDD-ASP can achieve higher quality of synthetic speech than that using conventional ASP and the WaveRNN vocoder on a text-to-speech (TTS) task. △ Less

Submitted 18 May, 2020; v1 submitted 16 April, 2020; originally announced April 2020.

Comments: Submitted to Interspeech 2020

arXiv:1906.09573 [pdf, other]

doi 10.1109/TASLP.2020.2970241

A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis

Authors: Yang Ai, Zhen-Hua Ling

Abstract: This paper presents a neural vocoder named HiNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra hierarchically. Different from existing neural vocoders such as WaveNet, SampleRNN and WaveRNN which directly generate waveform samples using single neural networks, the HiNet vocoder is composed of an amplitude spectrum predictor (ASP) and a phase s… ▽ More This paper presents a neural vocoder named HiNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra hierarchically. Different from existing neural vocoders such as WaveNet, SampleRNN and WaveRNN which directly generate waveform samples using single neural networks, the HiNet vocoder is composed of an amplitude spectrum predictor (ASP) and a phase spectrum predictor (PSP). The ASP is a simple DNN model which predicts log amplitude spectra (LAS) from acoustic features. The predicted LAS are sent into the PSP for phase recovery. Considering the issue of phase war** and the difficulty of phase modeling, the PSP is constructed by concatenating a neural source-filter (NSF) waveform generator with a phase extractor. We also introduce generative adversarial networks (GANs) into both ASP and PSP. Finally, the outputs of ASP and PSP are combined to reconstruct speech waveforms by short-time Fourier synthesis. Since there are no autoregressive structures in both predictors, the HiNet vocoder can generate speech waveforms with high efficiency. Objective and subjective experimental results show that our proposed HiNet vocoder achieves better naturalness of reconstructed speech than the conventional STRAIGHT vocoder, a 16-bit WaveNet vocoder using open source implementation and an NSF vocoder with similar complexity to the PSP and obtains similar performance with a 16-bit WaveRNN vocoder. We also find that the performance of HiNet is insensitive to the complexity of the neural waveform generator in PSP to some extend. After simplifying its model structure, the time consumed for generating 1s waveforms of 16kHz speech using a GPU can be further reduced from 0.34s to 0.19s without significant quality degradation. △ Less

Submitted 5 February, 2020; v1 submitted 23 June, 2019; originally announced June 2019.

Comments: Published in IEEE Transactions on Audio, Speech and Language Processing

arXiv:1906.08977 [pdf, other]

Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

Authors: Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, Li-Rong Dai

Abstract: This paper presents a method of using autoregressive neural networks for the acoustic modeling of singing voice synthesis (SVS). Singing voice differs from speech and it contains more local dynamic movements of acoustic features, e.g., vibratos. Therefore, our method adopts deep autoregressive (DAR) models to predict the F0 and spectral features of singing voice in order to better describe the dep… ▽ More This paper presents a method of using autoregressive neural networks for the acoustic modeling of singing voice synthesis (SVS). Singing voice differs from speech and it contains more local dynamic movements of acoustic features, e.g., vibratos. Therefore, our method adopts deep autoregressive (DAR) models to predict the F0 and spectral features of singing voice in order to better describe the dependencies among the acoustic features of consecutive frames. For F0 modeling, discretized F0 values are used and the influences of the history length in DAR are analyzed by experiments. An F0 post-processing strategy is also designed to alleviate the inconsistency between the predicted F0 contours and the F0 values determined by music notes. Furthermore, we extend the DAR model to deal with continuous spectral features, and a prenet module with self-attention layers is introduced to process historical frames. Experiments on a Chinese singing voice corpus demonstrate that our method using DARs can produce F0 contours with vibratos effectively, and can achieve better objective and subjective performance than the conventional method using recurrent neural networks (RNNs). △ Less

Submitted 21 June, 2019; originally announced June 2019.

Comments: Interspeech2019

arXiv:1801.07910 [pdf, other]

doi 10.1109/TASLP.2018.2798811

Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension

Authors: Zhen-Hua Ling, Yang Ai, Yu Gu, Li-Rong Dai

Abstract: This paper presents a waveform modeling and generation method using hierarchical recurrent neural networks (HRNN) for speech bandwidth extension (BWE). Different from conventional BWE methods which predict spectral parameters for reconstructing wideband speech waveforms, this BWE method models and predicts waveform samples directly without using vocoders. Inspired by SampleRNN which is an uncondit… ▽ More This paper presents a waveform modeling and generation method using hierarchical recurrent neural networks (HRNN) for speech bandwidth extension (BWE). Different from conventional BWE methods which predict spectral parameters for reconstructing wideband speech waveforms, this BWE method models and predicts waveform samples directly without using vocoders. Inspired by SampleRNN which is an unconditional neural audio generator, the HRNN model represents the distribution of each wideband or high-frequency waveform sample conditioned on the input narrowband waveform samples using a neural network composed of long short-term memory (LSTM) layers and feed-forward (FF) layers. The LSTM layers form a hierarchical structure and each layer operates at a specific temporal resolution to efficiently capture long-span dependencies between temporal sequences. Furthermore, additional conditions, such as the bottleneck (BN) features derived from narrowband speech using a deep neural network (DNN)-based state classifier, are employed as auxiliary input to further improve the quality of generated wideband speech. The experimental results of comparing several waveform modeling methods show that the HRNN-based method can achieve better speech quality and run-time efficiency than the dilated convolutional neural network (DCNN)-based method and the plain sample-level recurrent neural network (SRNN)-based method. Our proposed method also outperforms the conventional vocoder-based BWE method using LSTM-RNNs in terms of the subjective quality of the reconstructed wideband speech. △ Less

Submitted 24 January, 2018; originally announced January 2018.

Comments: Accepted by IEEE Transactions on Audio, Speech and Language Processing

Showing 1–34 of 34 results for author: Ai, Y