Search | arXiv e-print repository

arXiv:2306.02317 [pdf, other]

SpellMapper: A non-autoregressive neural spellchecker for ASR customization with candidate retrieval based on n-gram map**s

Authors: Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg

Abstract: Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition (ASR) quality given user vocabulary. To deal with large user vocabularies, most of these models include candidate retrieval mechanisms, usually based on minimum edit distance between fragments of ASR hypothesis and user phrases. However, the edit-distance approach is slow, non-trainab… ▽ More Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition (ASR) quality given user vocabulary. To deal with large user vocabularies, most of these models include candidate retrieval mechanisms, usually based on minimum edit distance between fragments of ASR hypothesis and user phrases. However, the edit-distance approach is slow, non-trainable, and may have low recall as it relies only on common letters. We propose: 1) a novel algorithm for candidate retrieval, based on misspelled n-gram map**s, which gives up to 90% recall with just the top 10 candidates on Spoken Wikipedia; 2) a non-autoregressive neural model based on BERT architecture, where the initial transcript and ten candidates are combined into one input. The experiments on Spoken Wikipedia show 21.4% word error rate improvement compared to a baseline ASR system. △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2104.05055 [pdf, other]

NeMo Inverse Text Normalization: From Development To Production

Authors: Yang Zhang, Evelina Bakhturina, Kyle Gorman, Boris Ginsburg

Abstract: Inverse text normalization (ITN) converts spoken-domain automatic speech recognition (ASR) output into written-domain text to improve the readability of the ASR output. Many state-of-the-art ITN systems use hand-written weighted finite-state transducer(WFST) grammars since this task has extremely low tolerance to unrecoverable errors. We introduce an open-source Python WFST-based library for ITN w… ▽ More Inverse text normalization (ITN) converts spoken-domain automatic speech recognition (ASR) output into written-domain text to improve the readability of the ASR output. Many state-of-the-art ITN systems use hand-written weighted finite-state transducer(WFST) grammars since this task has extremely low tolerance to unrecoverable errors. We introduce an open-source Python WFST-based library for ITN which enables a seamless path from development to production. We describe the specification of ITN grammar rules for English, but the library can be adapted for other languages. It can also be used for written-to-spoken text normalization. We evaluate the NeMo ITN library using a modified version of the Google Text normalization dataset. △ Less

Submitted 17 May, 2021; v1 submitted 11 April, 2021; originally announced April 2021.

arXiv:2104.04896 [pdf]

A Toolbox for Construction and Analysis of Speech Datasets

Authors: Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg

Abstract: Automatic Speech Recognition and Text-to-Speech systems are primarily trained in a supervised fashion and require high-quality, accurately labeled speech datasets. In this work, we examine common problems with speech data and introduce a toolbox for the construction and interactive error analysis of speech datasets. The construction tool is based on Kürzinger et al. work, and, to the best of our k… ▽ More Automatic Speech Recognition and Text-to-Speech systems are primarily trained in a supervised fashion and require high-quality, accurately labeled speech datasets. In this work, we examine common problems with speech data and introduce a toolbox for the construction and interactive error analysis of speech datasets. The construction tool is based on Kürzinger et al. work, and, to the best of our knowledge, the dataset exploration tool is the world's first open-source tool of this kind. We demonstrate how to apply these tools to create a Russian speech dataset and analyze existing speech datasets (Multilingual LibriSpeech, Mozilla Common Voice). The tools are open sourced as a part of the NeMo framework. △ Less

Submitted 6 January, 2022; v1 submitted 10 April, 2021; originally announced April 2021.

arXiv:2104.01497 [pdf, other]

Hi-Fi Multi-Speaker English TTS Dataset

Authors: Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang

Abstract: This paper introduces a new multi-speaker English dataset for training text-to-speech models. The dataset is based on LibriVox audiobooks and Project Gutenberg texts, both in the public domain. The new dataset contains about 292 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz. To select speech samples with high quality, we considered audio recordings with a… ▽ More This paper introduces a new multi-speaker English dataset for training text-to-speech models. The dataset is based on LibriVox audiobooks and Project Gutenberg texts, both in the public domain. The new dataset contains about 292 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz. To select speech samples with high quality, we considered audio recordings with a signal bandwidth of at least 13 kHz and a signal-to-noise ratio (SNR) of at least 32 dB. The dataset is publicly released at http://www.openslr.org/109/ . △ Less

Submitted 14 June, 2021; v1 submitted 3 April, 2021; originally announced April 2021.

Showing 1–4 of 4 results for author: Bakhturina, E