Skip to main content

Showing 1–11 of 11 results for author: Yeung, Y T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.08989  [pdf, other

    eess.AS cs.SD

    ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis

    Authors: Dehua Tao, Daxin Tan, Yu Ting Yeung, Xiao Chen, Tan Lee

    Abstract: Representing speech as discretized units has numerous benefits in supporting downstream spoken language processing tasks. However, the approach has been less explored in speech synthesis of tonal languages like Mandarin Chinese. Our preliminary experiments on Chinese speech synthesis reveal the issue of "tone shift", where a synthesized speech utterance contains correct base syllables but incorrec… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  2. arXiv:2310.05374  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis

    Authors: Jianqiao Lu, Wenyong Huang, Nianzu Zheng, Xingshan Zeng, Yu Ting Yeung, Xiao Chen

    Abstract: Training a high performance end-to-end speech (E2E) processing model requires an enormous amount of labeled speech data, especially in the era of data-centric artificial intelligence. However, labeled speech data are usually scarcer and more expensive for collection, compared to textual data. We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech proces… ▽ More

    Submitted 24 October, 2023; v1 submitted 8 October, 2023; originally announced October 2023.

    Comments: 15 pages, 8 figures, 8 tables, Accepted to EMNLP 2023 Findings

  3. arXiv:2204.05460  [pdf, other

    eess.AS cs.CL cs.SD

    CorrectSpeech: A Fully Automated System for Speech Correction and Accent Reduction

    Authors: Daxin Tan, Liqun Deng, Nianzu Zheng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee

    Abstract: This study propose a fully automated system for speech correction and accent reduction. Consider the application scenario that a recorded speech audio contains certain errors, e.g., inappropriate words, mispronunciations, that need to be corrected. The proposed system, named CorrectSpeech, performs the correction in three steps: recognizing the recorded speech and converting it into time-stamped s… ▽ More

    Submitted 13 October, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Accepted by ISCSLP 2022

  4. arXiv:2201.12155  [pdf, other

    cs.CL cs.SD eess.AS

    Reducing language context confusion for end-to-end code-switching automatic speech recognition

    Authors: Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Jianhua Tao, Yu Ting Yeung, Liqun Deng

    Abstract: Code-switching deals with alternative languages in communication process. Training end-to-end (E2E) automatic speech recognition (ASR) systems for code-switching is especially challenging as code-switching training data are always insufficient to combat the increased multilingual context confusion due to the presence of more than one language. We propose a language-related attention mechanism to r… ▽ More

    Submitted 29 June, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: arXiv admin note: text overlap with arXiv:2010.14798,the paper has been accepted by Insterspeech 2022

  5. arXiv:2201.10207  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training

    Authors: Wenyong Huang, Zhenhe Zhang, Yu Ting Yeung, Xin Jiang, Qun Liu

    Abstract: We introduce a new approach for speech pre-training named SPIRAL which works by learning denoising representation of perturbed data in a teacher-student framework. Specifically, given a speech utterance, we first feed the utterance to a teacher network to obtain corresponding representation. Then the same utterance is perturbed and fed to a student network. The student network is trained to output… ▽ More

    Submitted 6 March, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

    Comments: ICLR 2022

  6. arXiv:2111.08191  [pdf, other

    cs.CL cs.SD eess.AS

    CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis

    Authors: Nianzu Zheng, Liqun Deng, Wenyong Huang, Yu Ting Yeung, Baohua Xu, Yuanyuan Guo, Yasheng Wang, Xiao Chen, Xin Jiang, Qun Liu

    Abstract: Mispronunciation detection and diagnosis (MDD) is a popular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significant time latency especially for long paragraphs. We propose a streaming e2e MDD model called CoCA-MDD.… ▽ More

    Submitted 29 June, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

    Comments: 5 pages, 4 figures, Accepted by INTERSPEECH 2022

  7. arXiv:2107.01554  [pdf, other

    eess.AS cs.SD

    EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion

    Authors: Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee

    Abstract: This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bi… ▽ More

    Submitted 7 October, 2021; v1 submitted 4 July, 2021; originally announced July 2021.

    Comments: Accepted by ASRU 2021

  8. arXiv:2106.10132  [pdf, other

    eess.AS cs.CL cs.MM cs.SD eess.SP

    VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

    Authors: Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng

    Abstract: One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and t… ▽ More

    Submitted 18 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021. Code, pre-trained models and demo are available at https://github.com/Wendison/VQMIVC

  9. arXiv:2106.10127  [pdf, other

    eess.AS cs.CL cs.SD eess.SP

    Unsupervised Domain Adaptation for Dysarthric Speech Detection via Domain Adversarial Training and Mutual Information Minimization

    Authors: Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng

    Abstract: Dysarthric speech detection (DSD) systems aim to detect characteristics of the neuromotor disorder from speech. Such systems are particularly susceptible to domain mismatch where the training and testing data come from the source and target domains respectively, but the two domains may differ in terms of speech stimuli, disease etiology, etc. It is hard to acquire labelled data in the target domai… ▽ More

    Submitted 18 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021

  10. arXiv:2010.11657  [pdf, other

    cs.SD cs.CL eess.AS

    The HUAWEI Speaker Diarisation System for the VoxCeleb Speaker Diarisation Challenge

    Authors: Renyu Wang, Ruilin Tong, Yu Ting Yeung, Xiao Chen

    Abstract: This paper describes system setup of our submission to speaker diarisation track (Track 4) of VoxCeleb Speaker Recognition Challenge 2020. Our diarisation system consists of a well-trained neural network based speech enhancement model as pre-processing front-end of input speech signals. We replace conventional energy-based voice activity detection (VAD) with a neural network based VAD. The neural… ▽ More

    Submitted 23 October, 2020; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: 5 pages, 2 figures, A report about our diarisation system for VoxCeleb Challenge, Interspeech conference workshop

  11. arXiv:2008.05750  [pdf, other

    eess.AS cs.CL cs.SD

    Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition

    Authors: Wenyong Huang, Wenchao Hu, Yu Ting Yeung, Xiao Chen

    Abstract: Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR), and requires significantly less training time than RNN-based models. The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR. It relies on an attention mechanism to learn alignments, and encodes input audio bidirectionally. The hig… ▽ More

    Submitted 13 August, 2020; originally announced August 2020.

    Comments: Accepted by INTERSPEECH 2020