Search | arXiv e-print repository

arXiv:2007.06949 [pdf, other]

Deep Transformer based Data Augmentation with Subword Units for Morphologically Rich Online ASR

Authors: Balázs Tarján, György Szaszák, Tibor Fegyó, Péter Mihajlik

Abstract: Recently Deep Transformer models have proven to be particularly powerful in language modeling tasks for ASR. Their high complexity, however, makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of neural network Language Models (LM) can be transferred to traditional n-grams by using neural text generation… ▽ More Recently Deep Transformer models have proven to be particularly powerful in language modeling tasks for ASR. Their high complexity, however, makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of neural network Language Models (LM) can be transferred to traditional n-grams by using neural text generation based data augmentation. In our paper, we pre-train a GPT-2 Transformer LM on a general text corpus and fine-tune it on our Hungarian conversational call center ASR task. We show that although data augmentation with Transformer-generated text works well for isolating languages, it causes a vocabulary explosion in a morphologically rich language. Therefore, we propose a new method called subword-based neural text augmentation, where we retokenize the generated text into statistically derived subwords. We compare Morfessor and BPE statistical subword tokenizers and show that both methods can significantly improve the WER while greatly reducing vocabulary size and memory requirements. Finally, we also demonstrate that subword-based neural text augmentation outperforms the word-based approach not only in terms of overall WER but also in recognition of OOV words. △ Less

Submitted 4 November, 2020; v1 submitted 14 July, 2020; originally announced July 2020.

Comments: 7 pages, 4 figures

arXiv:2006.05129 [pdf, other]

doi 10.1007/978-3-030-58323-1_47

On the Effectiveness of Neural Text Generation based Data Augmentation for Recognition of Morphologically Rich Speech

Authors: Balázs Tarján, György Szaszák, Tibor Fegyó, Péter Mihajlik

Abstract: Advanced neural network models have penetrated Automatic Speech Recognition (ASR) in recent years, however, in language modeling many systems still rely on traditional Back-off N-gram Language Models (BNLM) partly or entirely. The reason for this are the high cost and complexity of training and using neural language models, mostly possible by adding a second decoding pass (rescoring). In our recen… ▽ More Advanced neural network models have penetrated Automatic Speech Recognition (ASR) in recent years, however, in language modeling many systems still rely on traditional Back-off N-gram Language Models (BNLM) partly or entirely. The reason for this are the high cost and complexity of training and using neural language models, mostly possible by adding a second decoding pass (rescoring). In our recent work we have significantly improved the online performance of a conversational speech transcription system by transferring knowledge from a Recurrent Neural Network Language Model (RNNLM) to the single pass BNLM with text generation based data augmentation. In the present paper we analyze the amount of transferable knowledge and demonstrate that the neural augmented LM (RNN-BNLM) can help to capture almost 50% of the knowledge of the RNNLM yet by drop** the second decoding pass and making the system real-time capable. We also systematically compare word and subword LMs and show that subword-based neural text augmentation can be especially beneficial in under-resourced conditions. In addition, we show that using the RNN-BNLM in the first pass followed by a neural second pass, offline ASR results can be even significantly improved. △ Less

Submitted 9 June, 2020; originally announced June 2020.

Comments: 8 pages, 2 figures, accepted for publication at TSD 2020

arXiv:1911.06615 [pdf]

doi 10.3311/PPee.17024

Deep learning methods in speaker recognition: a review

Authors: Dávid Sztahó, György Szaszák, András Beke

Abstract: This paper summarizes the applied deep learning practices in the field of speaker recognition, both verification and identification. Speaker recognition has been a widely used field topic of speech technology. Many research works have been carried out and little progress has been achieved in the past 5-6 years. However, as deep learning techniques do advance in most machine learning fields, the fo… ▽ More This paper summarizes the applied deep learning practices in the field of speaker recognition, both verification and identification. Speaker recognition has been a widely used field topic of speech technology. Many research works have been carried out and little progress has been achieved in the past 5-6 years. However, as deep learning techniques do advance in most machine learning fields, the former state-of-the-art methods are getting replaced by them in speaker recognition too. It seems that DL becomes the now state-of-the-art solution for both speaker verification and identification. The standard x-vectors, additional to i-vectors, are used as baseline in most of the novel works. The increasing amount of gathered data opens up the territory to DL, where they are the most effective. △ Less

Submitted 14 November, 2019; originally announced November 2019.

arXiv:1907.06407 [pdf, other]

doi 10.1007/978-3-030-31372-2_19

Investigation on N-gram Approximated RNNLMs for Recognition of Morphologically Rich Speech

Authors: Balázs Tarján, György Szaszák, Tibor Fegyó, Péter Mihajlik

Abstract: Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Recurrent Neural Network Language Model (RNNLM) can provide remedy for the high perplexity of the task; however, two-pass decoding introduces a considerable processing delay. In order to eliminate this delay we investigate approaches aiming at the complexity… ▽ More Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Recurrent Neural Network Language Model (RNNLM) can provide remedy for the high perplexity of the task; however, two-pass decoding introduces a considerable processing delay. In order to eliminate this delay we investigate approaches aiming at the complexity reduction of RNNLM, while preserving its accuracy. We compare the performance of conventional back-off n-gram language models (BNLM), BNLM approximation of RNNLMs (RNN-BNLM) and RNN n-grams in terms of perplexity and word error rate (WER). Morphological richness is often addressed by using statistically derived subwords - morphs - in the language models, hence our investigations are extended to morph-based models, as well. We found that using RNN-BNLMs 40% of the RNNLM perplexity reduction can be recovered, which is roughly equal to the performance of a RNN 4-gram model. Combining morph-based modeling and approximation of RNNLM, we were able to achieve 8% relative WER reduction and preserve real-time operation of our conversational telephone speech recognition system. △ Less

Submitted 19 September, 2019; v1 submitted 15 July, 2019; originally announced July 2019.

Comments: 12 pages, 2 figures, accepted for publication at SLSP 2019

Showing 1–4 of 4 results for author: Szaszák, G