Skip to main content

Showing 1–34 of 34 results for author: Ai, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.08266  [pdf, other

    eess.AS cs.SD

    Refining Self-Supervised Learnt Speech Representation using Brain Activations

    Authors: Hengyu Li, Kangdi Mei, Zhaoci Liu, Yang Ai, Li** Chen, Jie Zhang, Zhenhua Ling

    Abstract: It was shown in literature that speech representations extracted by self-supervised pre-trained models exhibit similarities with brain activations of human for speech perception and fine-tuning speech representation models on downstream tasks can further improve the similarity. However, it still remains unclear if this similarity can be used to optimize the pre-trained speech models. In this work,… ▽ More

    Submitted 13 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: accpeted by Interspeech2024

  2. arXiv:2406.02250  [pdf, other

    eess.AS cs.SD

    Multi-Stage Speech Bandwidth Extension with Flexible Sampling Rate Control

    Authors: Ye-Xin Lu, Yang Ai, Zheng-Yan Sheng, Zhen-Hua Ling

    Abstract: The majority of existing speech bandwidth extension (BWE) methods operate under the constraint of fixed source and target sampling rates, which limits their flexibility in practical applications. In this paper, we propose a multi-stage speech BWE model named MS-BWE, which can handle a set of source and target sampling rate pairs and achieve flexible extensions of frequency bandwidth. The proposed… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  3. arXiv:2406.02162  [pdf, other

    eess.AS cs.SD

    BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation

    Authors: Hui-Peng Du, Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

    Abstract: This paper proposes a novel bidirectional neural vocoder, named BiVocoder, capable both of feature extraction and reverse waveform generation within the short-time Fourier transform (STFT) domain. For feature extraction, the BiVocoder takes amplitude and phase spectra derived from STFT as inputs, transforms them into long-frame-shift and low-dimensional features through convolutional neural networ… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  4. arXiv:2404.08857  [pdf, other

    cs.SD cs.AI eess.AS

    Voice Attribute Editing with Text Prompt

    Authors: Zhengyan Sheng, Yang Ai, Li-Juan Liu, Jia Pan, Zhen-Hua Ling

    Abstract: Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modifications to voice attributes according to the actions described in the text prompt. To solve this t… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

  5. arXiv:2403.17378  [pdf, other

    cs.SD eess.AS

    Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrap** Losses for Speech Generation Tasks

    Authors: Yang Ai, Zhen-Hua Ling

    Abstract: This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional la… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted by IEEE Transactions on Audio, Speech and Language Processing. arXiv admin note: substantial text overlap with arXiv:2211.15974

  6. arXiv:2402.10533  [pdf, other

    cs.SD eess.AS

    APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding

    Authors: Yang Ai, Xiao-Hang Jiang, Ye-Xin Lu, Hui-Peng Du, Zhen-Hua Ling

    Abstract: This paper introduces a novel neural audio codec targeting high waveform sampling rates and low bitrates named APCodec, which seamlessly integrates the strengths of parametric codecs and waveform codecs. The APCodec revolutionizes the process of audio encoding and decoding by concurrently handling the amplitude and phase spectra as audio parametric characteristics like parametric codecs. It is com… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  7. arXiv:2401.06387  [pdf, other

    eess.AS cs.SD eess.SP

    Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction

    Authors: Ye-Xin Lu, Yang Ai, Hui-Peng Du, Zhen-Hua Ling

    Abstract: Speech bandwidth extension (BWE) refers to widening the frequency bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both high-quality and efficient wideband speech waveform generation. The propose… ▽ More

    Submitted 12 January, 2024; originally announced January 2024.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  8. arXiv:2311.11545  [pdf, other

    eess.AS

    APNet2: High-quality and High-efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase Spectra

    Authors: Hui-Peng Du, Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

    Abstract: In our previous work, we proposed a neural vocoder called APNet, which directly predicts speech amplitude and phase spectra with a 5 ms frame shift in parallel from the input acoustic features, and then reconstructs the 16 kHz speech waveform using inverse short-time Fourier transform (ISTFT). APNet demonstrates the capability to generate synthesized speech of comparable quality to the HiFi-GAN vo… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

  9. arXiv:2309.10455  [pdf, other

    eess.AS cs.SD

    Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

    Authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

    Abstract: Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images duri… ▽ More

    Submitted 20 November, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Submmited to IEEE/ACM Transactions on Audio, Speech and Language Processing. arXiv admin note: text overlap with arXiv:2305.14933

  10. arXiv:2309.09470  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

    Authors: Zheng-Yan Sheng, Yang Ai, Yan-Nian Chen, Zhen-Hua Ling

    Abstract: This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely relying on a single face image of the target speaker. To address this task, we propose a face-voice memory-based zero-shot FaceVC method. This method leverages a memo… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

  11. arXiv:2308.08926  [pdf, other

    eess.AS cs.SD

    Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

    Authors: Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

    Abstract: Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrap** characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech En… ▽ More

    Submitted 1 April, 2024; v1 submitted 17 August, 2023; originally announced August 2023.

    Comments: Submmited to IEEE Transactions on Audio, Speech and Language Processing

  12. arXiv:2308.08850  [pdf, other

    cs.SD eess.AS

    Long-frame-shift Neural Speech Phase Prediction with Spectral Continuity Enhancement and Interpolation Error Compensation

    Authors: Yang Ai, Ye-Xin Lu, Zhen-Hua Ling

    Abstract: Speech phase prediction, which is a significant research focus in the field of signal processing, aims to recover speech phase spectra from amplitude-related features. However, existing speech phase prediction methods are constrained to recovering phase spectra with short frame shifts, which are considerably smaller than the theoretical upper bound required for exact waveform reconstruction of sho… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: Published at IEEE Signal Processing Letters

  13. Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation

    Authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

    Abstract: Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes further incorporating ultrasound tongue images to improve lip-based AV-SE systems' performance. Knowledge distillation is employed at the training stage to address the challeng… ▽ More

    Submitted 20 November, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Published in InterSpeech 2023

    Journal ref: Proc. INTERSPEECH 2023, 844-848 (2023)

  14. arXiv:2305.14359  [pdf, other

    cs.MM cs.AI cs.CV cs.SD eess.AS

    Zero-shot personalized lip-to-speech synthesis with face image based voice control

    Authors: Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling

    Abstract: Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted from natural reference speech and are unavailabl… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

    Comments: ICASSP 2023

  15. MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra

    Authors: Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

    Abstract: This paper proposes MP-SENet, a novel Speech Enhancement Network which directly denoises Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec architecture in which the encoder and decoder are bridged by convolution-augmented transformers. The encoder aims to encode time-frequency representations from the input noisy magnitude and phase spectra. The decoder is composed of p… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted by Interspeech 2023

  16. arXiv:2305.07952  [pdf, other

    cs.SD eess.AS

    APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra

    Authors: Yang Ai, Zhen-Hua Ling

    Abstract: This paper presents a novel neural vocoder named APNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra directly. The APNet vocoder is composed of an amplitude spectrum predictor (ASP) and a phase spectrum predictor (PSP). The ASP is a residual convolution network which predicts frame-level log amplitude spectra from acoustic features. The PSP al… ▽ More

    Submitted 13 May, 2023; originally announced May 2023.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing. Codes are available

  17. Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

    Authors: Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

    Abstract: This paper proposes a source-filter-based generative adversarial neural vocoder named SF-GAN, which achieves high-fidelity waveform generation from input acoustic features by introducing F0-based source excitation signals to a neural filter framework. The SF-GAN vocoder is composed of a source module and a resolution-wise conditional filter module and is trained based on generative adversarial str… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

    Comments: Accepted by NCMMSC 2022

    Journal ref: Man-Machine Speech Communication, 2022, pp.68-80

  18. arXiv:2304.11106  [pdf, other

    cs.NE cs.LG eess.SP

    A Convolutional Spiking Network for Gesture Recognition in Brain-Computer Interfaces

    Authors: Yiming Ai, Bipin Rajendran

    Abstract: Brain-computer interfaces are being explored for a wide variety of therapeutic applications. Typically, this involves measuring and analyzing continuous-time electrical brain activity via techniques such as electrocorticogram (ECoG) or electroencephalography (EEG) to drive external devices. However, due to the inherent noise and variability in the measurements, the analysis of these signals is cha… ▽ More

    Submitted 27 April, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

    Comments: Accepted at AICAS 2023

  19. arXiv:2304.05574  [pdf, other

    eess.AS

    Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training

    Authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

    Abstract: This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing sound. This task falls under the umbrella of articulatory-to-acoustic conversion, and may also be refered to as a silent speech interface. We propose to employ a method… ▽ More

    Submitted 11 April, 2023; originally announced April 2023.

    Comments: To be published in ICASSP2023

  20. arXiv:2211.15974  [pdf, other

    cs.SD eess.AS

    Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrap** Losses

    Authors: Yang Ai, Zhen-Hua Ling

    Abstract: This paper presents a novel speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra by neural networks. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is composed of two parallel linear convolutional layers and a phase calculation formula, imitating the proce… ▽ More

    Submitted 16 February, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

    Comments: Accepted by ICASSP 2023. Codes are available

  21. arXiv:2111.09413  [pdf, other

    cs.NI eess.SP

    Mixed Dual-Hop IRS-Assisted FSO-RF Communication System with H-ARQ Protocols

    Authors: Gyan Deep Verma, Aashish Mathur, Yun Ai, Michael Cheffena

    Abstract: Intelligent reflecting surface (IRS) is an emerging key technology for the fifth-generation (5G) and beyond wireless communication systems to provide more robust and reliable communication links. In this paper, we propose a mixed dual-hop free-space optical (FSO)-radio frequency (RF) communication system that serves the end user via a decode-and-forward (DF) relay employing hybrid automatic repeat… ▽ More

    Submitted 20 August, 2021; originally announced November 2021.

    Comments: 5 pages, 6 figures

  22. arXiv:2103.11657  [pdf, other

    eess.SP

    Communication Technologies for Smart Grid: A Comprehensive Survey

    Authors: Fredrik Ege Abrahamsen, Yun Ai, Michael Cheffena

    Abstract: With the ongoing trends in the energy sector such as vehicular electrification and renewable energy, smart grid is clearly playing a more and more important role in the electric power system industry. One essential feature of the smart grid is the information flow over the high-speed, reliable and secure data communication network in order to manage the complex power systems effectively and intell… ▽ More

    Submitted 7 June, 2021; v1 submitted 22 March, 2021; originally announced March 2021.

  23. Secure Outage Analysis of FSO Communications Over Arbitrarily Correlated Málaga Turbulence Channels

    Authors: Yun Ai, Aashish Mathur, Long Kong, Michael Cheffena

    Abstract: In this paper, we analyze the secrecy outage performance for more realistic eavesdrop** scenario of free-space optical (FSO) communications, where the main and wiretap links are correlated. The FSO fading channels are modeled by the well-known Málaga distribution. Exact expressions for the secrecy performance metrics such as secrecy outage probability (SOP) and probability of the non zero secrec… ▽ More

    Submitted 12 March, 2021; originally announced March 2021.

    Comments: 6 pages, 5 figures

  24. arXiv:2011.14899  [pdf, ps, other

    eess.SP eess.SY

    Secure Vehicular Communications through Reconfigurable Intelligent Surfaces

    Authors: Yun Ai, Felipe A. P. de Figueiredo, Long Kong, Michael Cheffena, Symeon Chatzinotas, Björn Ottersten

    Abstract: Reconfigurable intelligent surfaces (RIS) is considered as a revolutionary technique to improve the wireless system performance by reconfiguring the radio wave propagation environment artificially. Motivated by the potential of RIS in vehicular networks, we analyze the secrecy outage performance of RIS-aided vehicular communications in this paper. More specifically, two vehicular communication sce… ▽ More

    Submitted 2 December, 2020; v1 submitted 30 November, 2020; originally announced November 2020.

    Comments: 6 pages

  25. arXiv:2011.05038  [pdf, other

    eess.AS cs.SD

    Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model

    Authors: Haoyu Li, Yang Ai, Junichi Yamagishi

    Abstract: High-quality speech corpora are essential foundations for most speech applications. However, such speech data are expensive and limited since they are collected in professional recording environments. In this work, we propose an encoder-decoder neural network to automatically enhance low-quality recordings to professional high-quality recordings. To address channel variability, we first filter out… ▽ More

    Submitted 10 November, 2020; originally announced November 2020.

    Comments: 8 pages. Accepted to IEEE SLT 2021

  26. arXiv:2011.03955  [pdf, other

    cs.SD eess.AS

    Denoising-and-Dereverberation Hierarchical Neural Vocoder for Robust Waveform Generation

    Authors: Yang Ai, Haoyu Li, Xin Wang, Junichi Yamagishi, Zhenhua Ling

    Abstract: This paper presents a denoising and dereverberation hierarchical neural vocoder (DNR-HiNet) to convert noisy and reverberant acoustic features into a clean speech waveform. We implement it mainly by modifying the amplitude spectrum predictor (ASP) in the original HiNet vocoder. This modified denoising and dereverberation ASP (DNR-ASP) can predict clean log amplitude spectra (LAS) from input degrad… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

    Comments: Accepted by SLT 2021

  27. arXiv:2008.06182  [pdf, other

    eess.AS

    Online Speaker Adaptation for WaveNet-based Neural Vocoders

    Authors: Qiuchen Huang, Yang Ai, Zhenhua Ling

    Abstract: In this paper, we propose an online speaker adaptation method for WaveNet-based neural vocoders in order to improve their performance on speaker-independent waveform generation. In this method, a speaker encoder is first constructed using a large speaker-verification dataset which can extract a speaker embedding vector from an utterance pronounced by an arbitrary speaker. At the training stage, a… ▽ More

    Submitted 13 August, 2020; originally announced August 2020.

    Comments: 6 pages, 2 figures, 4 tables

  28. arXiv:2007.10786  [pdf

    cs.LG cs.AI eess.SP

    Comparison of Different Methods for Time Sequence Prediction in Autonomous Vehicles

    Authors: Teng Liu, Bin Tian, Yunfeng Ai, Long Chen, Fei Liu, Dongpu Cao

    Abstract: As a combination of various kinds of technologies, autonomous vehicles could complete a series of driving tasks by itself, such as perception, decision-making, planning, and control. Since there is no human driver to handle the emergency situation, future transportation information is significant for automated vehicles. This paper proposes different methods to forecast the time series for autonomo… ▽ More

    Submitted 16 July, 2020; originally announced July 2020.

    Comments: 6 pages, 11 figures

  29. arXiv:2005.07379  [pdf, other

    cs.SD eess.AS

    Reverberation Modeling for Source-Filter-based Neural Vocoder

    Authors: Yang Ai, Xin Wang, Junichi Yamagishi, Zhen-Hua Ling

    Abstract: This paper presents a reverberation module for source-filter-based neural vocoders that improves the performance of reverberant effect modeling. This module uses the output waveform of neural vocoders as an input and produces a reverberant waveform by convolving the input with a room impulse response (RIR). We propose two approaches to parameterizing and estimating the RIR. The first approach assu… ▽ More

    Submitted 15 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  30. arXiv:2004.13761  [pdf

    eess.SP

    A Method for Vehicle Collision Risk Assessment through Inferring Driver's Braking Actions in Near-Crash Situations

    Authors: Liqun Peng, Miguel Angel Sotelo, Yi He, Yunfei Ai, Zhixiong Li

    Abstract: Driving information and data under potential vehicle crashes create opportunities for extensive real-world observations of driver behaviors and relevant factors that significantly influence the driving safety in emergency scenarios. Furthermore, the availability of such data also enhances the collision avoidance systems (CASs) by evaluating driver's actions in near-crash scenarios and providing ti… ▽ More

    Submitted 28 April, 2020; originally announced April 2020.

    Comments: 14 pages

  31. arXiv:2004.07832  [pdf, other

    eess.AS cs.SD

    Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders

    Authors: Yang Ai, Zhen-Hua Ling

    Abstract: In our previous work, we have proposed a neural vocoder called HiNet which recovers speech waveforms by predicting amplitude and phase spectra hierarchically from input acoustic features. In HiNet, the amplitude spectrum predictor (ASP) predicts log amplitude spectra (LAS) from input acoustic features. This paper proposes a novel knowledge-and-data-driven ASP (KDD-ASP) to improve the conventional… ▽ More

    Submitted 18 May, 2020; v1 submitted 16 April, 2020; originally announced April 2020.

    Comments: Submitted to Interspeech 2020

  32. A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis

    Authors: Yang Ai, Zhen-Hua Ling

    Abstract: This paper presents a neural vocoder named HiNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra hierarchically. Different from existing neural vocoders such as WaveNet, SampleRNN and WaveRNN which directly generate waveform samples using single neural networks, the HiNet vocoder is composed of an amplitude spectrum predictor (ASP) and a phase s… ▽ More

    Submitted 5 February, 2020; v1 submitted 23 June, 2019; originally announced June 2019.

    Comments: Published in IEEE Transactions on Audio, Speech and Language Processing

  33. arXiv:1906.08977  [pdf, other

    cs.SD cs.LG eess.AS

    Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

    Authors: Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, Li-Rong Dai

    Abstract: This paper presents a method of using autoregressive neural networks for the acoustic modeling of singing voice synthesis (SVS). Singing voice differs from speech and it contains more local dynamic movements of acoustic features, e.g., vibratos. Therefore, our method adopts deep autoregressive (DAR) models to predict the F0 and spectral features of singing voice in order to better describe the dep… ▽ More

    Submitted 21 June, 2019; originally announced June 2019.

    Comments: Interspeech2019

  34. Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension

    Authors: Zhen-Hua Ling, Yang Ai, Yu Gu, Li-Rong Dai

    Abstract: This paper presents a waveform modeling and generation method using hierarchical recurrent neural networks (HRNN) for speech bandwidth extension (BWE). Different from conventional BWE methods which predict spectral parameters for reconstructing wideband speech waveforms, this BWE method models and predicts waveform samples directly without using vocoders. Inspired by SampleRNN which is an uncondit… ▽ More

    Submitted 24 January, 2018; originally announced January 2018.

    Comments: Accepted by IEEE Transactions on Audio, Speech and Language Processing