Skip to main content

Showing 1–20 of 20 results for author: Scharenborg, O

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.10284  [pdf, other

    cs.CL cs.SD eess.AS

    Improving child speech recognition with augmented child-like speech

    Authors: Yuanyuan Zhang, Zhengjun Yue, Tanvina Patel, Odette Scharenborg

    Abstract: State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingua… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 5 pages, 1 figure Accepted to INTERSPEECH 2024

  2. arXiv:2312.15499  [pdf, other

    eess.AS

    Exploring data augmentation in bias mitigation against non-native-accented speech

    Authors: Yuanyuan Zhang, Aaricia Herygers, Tanvina Patel, Zhengjun Yue, Odette Scharenborg

    Abstract: Automatic speech recognition (ASR) should serve every speaker, not only the majority ``standard'' speakers of a language. In order to build inclusive ASR, mitigating the bias against speaker groups who speak in a ``non-standard'' or ``diverse'' way is crucial. We aim to mitigate the bias against non-native-accented Flemish in a Flemish ASR system. Since this is a low-resource problem, we investiga… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

    Comments: Accepted to ASRU 2023

  3. Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation

    Authors: Zhaofeng Lin, Tanvina Patel, Odette Scharenborg

    Abstract: Whispering is a distinct form of speech known for its soft, breathy, and hushed characteristics, often used for private communication. The acoustic characteristics of whispered speech differ substantially from normally phonated speech and the scarcity of adequate training data leads to low automatic speech recognition (ASR) performance. To address the data scarcity issue, we use a signal processin… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: Accepted to ASRU 2023

  4. arXiv:2309.08348  [pdf, other

    eess.AS cs.SD

    The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

    Authors: Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin-Hui Lee, **gdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao

    Abstract: Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures

  5. arXiv:2206.12489  [pdf, other

    eess.AS cs.SD

    Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models

    Authors: Hang Ji, Tanvina Patel, Odette Scharenborg

    Abstract: In this work, we analyzed and compared speech representations extracted from different frozen self-supervised learning (SSL) speech pre-trained models on their ability to capture articulatory features (AF) information and their subsequent prediction of phone recognition performance for within and across language scenarios. Specifically, we compared CPC, wav2vec 2.0, and HuBert. First, frame-level… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: Submitted to INTERSPEECH 2022

  6. arXiv:2203.17072  [pdf, other

    cs.SD cs.CL eess.AS

    Manipulation of oral cancer speech using neural articulatory synthesis

    Authors: Bence Mark Halpern, Teja Rebernik, Thomas Tienkamp, Rob van Son, Michiel van den Brekel, Martijn Wieling, Max Witjes, Odette Scharenborg

    Abstract: We present an articulatory synthesis framework for the synthesis and manipulation of oral cancer speech for clinical decision making and alleviation of patient stress. Objective and subjective evaluations demonstrate that the framework has acceptable naturalness and is worth further investigation. A subsequent subjective vowel and consonant identification experiment showed that the articulatory sy… ▽ More

    Submitted 31 March, 2022; originally announced March 2022.

    Comments: 5 pages, 4 tables, 1 figure. Submitted to Interspeech 2022

  7. arXiv:2201.11207  [pdf, other

    cs.SD cs.CL eess.AS

    Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

    Authors: Piotr Żelasko, Siyuan Feng, Laureano Moro Velazquez, Ali Abavisani, Saurabhchand Bhati, Odette Scharenborg, Mark Hasegawa-Johnson, Najim Dehak

    Abstract: The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to build ASR systems for these low-resource languages. Wh… ▽ More

    Submitted 27 January, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

    Comments: Accepted for publication in Computer Speech and Language

  8. arXiv:2201.04908  [pdf, ps, other

    cs.SD cs.AI eess.AS

    The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition

    Authors: Luke Prananta, Bence Mark Halpern, Siyuan Feng, Odette Scharenborg

    Abstract: In this paper, we investigate several existing and a new state-of-the-art generative adversarial network-based (GAN) voice conversion method for enhancing dysarthric speech for improved dysarthric speech recognition. We compare key components of existing methods as part of a rigorous ablation study to find the most effective solution to improve dysarthric speech recognition. We find that straightf… ▽ More

    Submitted 13 January, 2022; originally announced January 2022.

    Comments: Extended version of paper to be submitted to Interspeech 2022. 6 pages, 2 tables

  9. arXiv:2110.08213  [pdf, other

    cs.SD cs.CL eess.AS q-bio.QM

    Towards Identity Preserving Normal to Dysarthric Voice Conversion

    Authors: Wen-Chin Huang, Bence Mark Halpern, Lester Phillip Violeta, Odette Scharenborg, Tomoki Toda

    Abstract: We present a voice conversion framework that converts normal speech into dysarthric speech while preserving the speaker identity. Such a framework is essential for (1) clinical decision making processes and alleviation of patient stress, (2) data augmentation for dysarthric speech recognition. This is an especially challenging task since the converted samples should capture the severity of dysarth… ▽ More

    Submitted 15 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  10. arXiv:2107.00308  [pdf, other

    cs.SD cs.CL eess.AS

    An Objective Evaluation Framework for Pathological Speech Synthesis

    Authors: Bence Mark Halpern, Julian Fritsch, Enno Hermann, Rob van Son, Odette Scharenborg, Mathew Magimai. -Doss

    Abstract: The development of pathological speech systems is currently hindered by the lack of a standardised objective evaluation framework. In this work, (1) we utilise existing detection and analysis techniques to propose a general framework for the consistent evaluation of synthetic pathological speech. This framework evaluates the voice quality and the intelligibility aspects of speech and is shown to b… ▽ More

    Submitted 1 July, 2021; originally announced July 2021.

    Comments: 4 pages, 4 figures. Accepted to the ITG Conference on Speech Communication | 29.09.2021 - 01.10.2021 | Kiel

  11. arXiv:2106.08427  [pdf, other

    cs.SD cs.CL eess.AS

    Pathological voice adaptation with autoencoder-based voice conversion

    Authors: Marc Illa, Bence Mark Halpern, Rob van Son, Laureano Moro-Velazquez, Odette Scharenborg

    Abstract: In this paper, we propose a new approach to pathological speech synthesis. Instead of using healthy speech as a source, we customise an existing pathological speech sample to a new speaker's voice characteristics. This approach alleviates the evaluation problem one normally has when converting typical speech to pathological speech, as in our approach, the voice conversion (VC) model does not need… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

    Comments: 6 pages, 3 figures. Accepted to the 11th ISCA Speech Synthesis Workshop (2021)

  12. arXiv:2104.00994  [pdf, other

    eess.AS cs.CL cs.SD

    Unsupervised Acoustic Unit Discovery by Leveraging a Language-Independent Subword Discriminative Feature Representation

    Authors: Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Odette Scharenborg

    Abstract: This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data. Past studies usually proposed single-step approaches. We propose a two-stage approach: the first stage learns a subword-discriminative feature representation and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units. I… ▽ More

    Submitted 7 June, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

    Comments: Accepted for publication in INTERSPEECH 2021

  13. arXiv:2103.15122  [pdf, other

    eess.AS cs.CL cs.SD

    Quantifying Bias in Automatic Speech Recognition

    Authors: Siyuan Feng, Olya Kudina, Bence Mark Halpern, Odette Scharenborg

    Abstract: Automatic speech recognition (ASR) systems promise to deliver objective interpretation of human speech. Practice and recent evidence suggests that the state-of-the-art (SotA) ASRs struggle with the large variation in speech due to e.g., gender, age, speech impairment, race, and accents. Many factors can cause the bias of an ASR system. Our overarching goal is to uncover bias in ASR systems to work… ▽ More

    Submitted 1 April, 2021; v1 submitted 28 March, 2021; originally announced March 2021.

    Comments: Submitted to INTERSPEECH (IS) 2021. This preprint version differs slightly from the version submitted to IS 2021: Figure 1 is not included in IS 2021

  14. arXiv:2012.09544  [pdf, other

    eess.AS cs.CL cs.SD

    The effectiveness of unsupervised subword modeling with autoregressive and cross-lingual phone-aware networks

    Authors: Siyuan Feng, Odette Scharenborg

    Abstract: This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The framework consists of autoregressive predictive coding (APC) as the front-end and a cross-lingual deep neural network (… ▽ More

    Submitted 28 April, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

    Comments: 18 pages (including 1 page as supplementary material), 13 figures. Accepted for publication in IEEE Open Journal of Signal Processing (OJ-SP)

  15. arXiv:2011.06239  [pdf, other

    eess.AS cs.SD

    The CUHK-TUDELFT System for The SLT 2021 Children Speech Recognition Challenge

    Authors: Si-Ioi Ng, Wei Liu, Zhiyuan Peng, Siyuan Feng, Hing-Pang Huang, Odette Scharenborg, Tan Lee

    Abstract: This technical report describes our submission to the 2021 SLT Children Speech Recognition Challenge (CSRC) Track 1. Our approach combines the use of a joint CTC-attention end-to-end (E2E) speech recognition framework, transfer learning, data augmentation and development of various language models. Procedures of data pre-processing, the background and the course of system development are described… ▽ More

    Submitted 12 November, 2020; originally announced November 2020.

    Comments: Submitted to 2021 SLT Children Speech Recognition Challenge (CSRC)

  16. How Phonotactics Affect Multilingual and Zero-shot ASR Performance

    Authors: Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

    Abstract: The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successfu… ▽ More

    Submitted 10 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted for publication in IEEE ICASSP 2021. The first 2 authors contributed equally to this work

  17. arXiv:2007.14205  [pdf, other

    eess.AS cs.LG cs.SD

    Detecting and analysing spontaneous oral cancer speech in the wild

    Authors: Bence Mark Halpern, Rob van Son, Michiel van den Brekel, Odette Scharenborg

    Abstract: Oral cancer speech is a disease which impacts more than half a million people worldwide every year. Analysis of oral cancer speech has so far focused on read speech. In this paper, we 1) present and 2) analyse a three-hour long spontaneous oral cancer speech dataset collected from YouTube. 3) We set baselines for an oral cancer speech detection task on this dataset. The analysis of these explainab… ▽ More

    Submitted 28 July, 2020; originally announced July 2020.

    Comments: Accepted to Interspeech 2020

  18. Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling

    Authors: Siyuan Feng, Odette Scharenborg

    Abstract: This study addresses unsupervised subword modeling, i.e., learning feature representations that can distinguish subword units of a language. The proposed approach adopts a two-stage bottleneck feature (BNF) learning framework, consisting of autoregressive predictive coding (APC) as a front-end and a DNN-BNF model as a back-end. APC pretrained features are set as input features to a DNN-BNF model.… ▽ More

    Submitted 6 August, 2020; v1 submitted 25 July, 2020; originally announced July 2020.

    Comments: 5 pages, 3 figures. Accepted for publication in INTERSPEECH 2020, Shanghai, China

  19. arXiv:2005.08118  [pdf, other

    eess.AS cs.CL cs.SD

    That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

    Authors: Piotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

    Abstract: Only a handful of the world's languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020. For some reason, the ArXiv Latex engine rendered it in more than 4 pages

  20. arXiv:1803.05058  [pdf

    cs.SD cs.CL eess.AS

    Investigating the Effect of Music and Lyrics on Spoken-Word Recognition

    Authors: Odette Scharenborg, Martha Larson

    Abstract: Background music in social interaction settings can hinder conversation. Yet, little is known of how specific properties of music impact speech processing. This paper addresses this knowledge gap by investigating 1) whether the masking effect of background music with lyrics is larger than that of music without lyrics, and 2) whether the masking effect is larger for more complex music. To answer th… ▽ More

    Submitted 13 March, 2018; originally announced March 2018.

    Comments: Preliminary study