Search | arXiv e-print repository

doi 10.1109/SLT54892.2023.10022592

Accelerator-Aware Training for Transducer-Based Speech Recognition

Authors: Suhaila M. Shakiah, Rupak Vignesh Swaminathan, Hieu Duy Nguyen, Raviteja Chinta, Tariq Afzal, Nathan Susanj, Athanasios Mouchtaris, Grant P. Strimel, Ariya Rastrow

Abstract: Machine learning model weights and activations are represented in full-precision during training. This leads to performance degradation in runtime when deployed on neural network accelerator (NNA) chips, which leverage highly parallelized fixed-point arithmetic to improve runtime memory and latency. In this work, we replicate the NNA operators during the training phase, accounting for the degradat… ▽ More Machine learning model weights and activations are represented in full-precision during training. This leads to performance degradation in runtime when deployed on neural network accelerator (NNA) chips, which leverage highly parallelized fixed-point arithmetic to improve runtime memory and latency. In this work, we replicate the NNA operators during the training phase, accounting for the degradation due to low-precision inference on the NNA in back-propagation. Our proposed method efficiently emulates NNA operations, thus foregoing the need to transfer quantization error-prone data to the Central Processing Unit (CPU), ultimately reducing the user perceived latency (UPL). We apply our approach to Recurrent Neural Network-Transducer (RNN-T), an attractive architecture for on-device streaming speech recognition tasks. We train and evaluate models on 270K hours of English data and show a 5-7% improvement in engine latency while saving up to 10% relative degradation in WER. △ Less

Submitted 12 May, 2023; originally announced May 2023.

Comments: Accepted to SLT 2022

Journal ref: IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023, pp. 100-107

arXiv:2303.00692 [pdf, other]

Leveraging Redundancy in Multiple Audio Signals for Far-Field Speech Recognition

Authors: Feng-Ju Chang, Anastasios Alexandridis, Rupak Vignesh Swaminathan, Martin Radfar, Harish Mallidi, Maurizio Omologo, Athanasios Mouchtaris, Brian King, Roland Maas

Abstract: To achieve robust far-field automatic speech recognition (ASR), existing techniques typically employ an acoustic front end (AFE) cascaded with a neural transducer (NT) ASR model. The AFE output, however, could be unreliable, as the beamforming output in AFE is steered to a wrong direction. A promising way to address this issue is to exploit the microphone signals before the beamforming stage and a… ▽ More To achieve robust far-field automatic speech recognition (ASR), existing techniques typically employ an acoustic front end (AFE) cascaded with a neural transducer (NT) ASR model. The AFE output, however, could be unreliable, as the beamforming output in AFE is steered to a wrong direction. A promising way to address this issue is to exploit the microphone signals before the beamforming stage and after the acoustic echo cancellation (post-AEC) in AFE. We argue that both, post-AEC and AFE outputs, are complementary and it is possible to leverage the redundancy between these signals to compensate for potential AFE processing errors. We present two fusion networks to explore this redundancy and aggregate these multi-channel (MC) signals: (1) Frequency-LSTM based, and (2) Convolutional Neural Network based fusion networks. We augment the MC fusion networks to a conformer transducer model and train it in an end-to-end fashion. Our experimental results on commercial virtual assistant tasks demonstrate that using the AFE output and two post-AEC signals with fusion networks offers up to 25.9% word error rate (WER) relative improvement over the model using the AFE output only, at the cost of <= 2% parameter increase. △ Less

Submitted 1 March, 2023; originally announced March 2023.

arXiv:2210.16238 [pdf, ps, other]

Contextual-Utterance Training for Automatic Speech Recognition

Authors: Alejandro Gomez-Alanis, Lukas Drude, Andreas Schwarz, Rupak Vignesh Swaminathan, Simon Wiesler

Abstract: Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit… ▽ More Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling "in-place" the knowledge of a teacher, which is able to see both past and future contextual utterances, to the student which can only see the current and past contextual utterances. The experimental results show that a conformer-transducer system trained with the proposed techniques outperforms the same system trained with the classical RNN-T loss. Specifically, the proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative, respectively. △ Less

Submitted 27 October, 2022; originally announced October 2022.

arXiv:2209.14868 [pdf, other]

ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition

Authors: Martin Radfar, Rohit Barnwal, Rupak Vignesh Swaminathan, Feng-Ju Chang, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris

Abstract: The recurrent neural network transducer (RNN-T) is a prominent streaming end-to-end (E2E) ASR technology. In RNN-T, the acoustic encoder commonly consists of stacks of LSTMs. Very recently, as an alternative to LSTM layers, the Conformer architecture was introduced where the encoder of RNN-T is replaced with a modified Transformer encoder composed of convolutional layers at the frontend and betwee… ▽ More The recurrent neural network transducer (RNN-T) is a prominent streaming end-to-end (E2E) ASR technology. In RNN-T, the acoustic encoder commonly consists of stacks of LSTMs. Very recently, as an alternative to LSTM layers, the Conformer architecture was introduced where the encoder of RNN-T is replaced with a modified Transformer encoder composed of convolutional layers at the frontend and between attention layers. In this paper, we introduce a new streaming ASR model, Convolutional Augmented Recurrent Neural Network Transducers (ConvRNN-T) in which we augment the LSTM-based RNN-T with a novel convolutional frontend consisting of local and global context CNN encoders. ConvRNN-T takes advantage of causal 1-D convolutional layers, squeeze-and-excitation, dilation, and residual blocks to provide both global and local audio context representation to LSTM layers. We show ConvRNN-T outperforms RNN-T, Conformer, and ContextNet on Librispeech and in-house data. In addition, ConvRNN-T offers less computational complexity compared to Conformer. ConvRNN-T's superior accuracy along with its low footprint make it a promising candidate for on-device streaming ASR technologies. △ Less

Submitted 29 September, 2022; originally announced September 2022.

Comments: This paper was presented in Interspeech 2022

arXiv:2106.07734 [pdf, other]

CoDERT: Distilling Encoder Representations with Co-learning for Transducer-based Speech Recognition

Authors: Rupak Vignesh Swaminathan, Brian King, Grant P. Strimel, Jasha Droppo, Athanasios Mouchtaris

Abstract: We propose a simple yet effective method to compress an RNN-Transducer (RNN-T) through the well-known knowledge distillation paradigm. We show that the transducer's encoder outputs naturally have a high entropy and contain rich information about acoustically similar word-piece confusions. This rich information is suppressed when combined with the lower entropy decoder outputs to produce the joint… ▽ More We propose a simple yet effective method to compress an RNN-Transducer (RNN-T) through the well-known knowledge distillation paradigm. We show that the transducer's encoder outputs naturally have a high entropy and contain rich information about acoustically similar word-piece confusions. This rich information is suppressed when combined with the lower entropy decoder outputs to produce the joint network logits. Consequently, we introduce an auxiliary loss to distill the encoder logits from a teacher transducer's encoder, and explore training strategies where this encoder distillation works effectively. We find that tandem training of teacher and student encoders with an inplace encoder distillation outperforms the use of a pre-trained and static teacher transducer. We also report an interesting phenomenon we refer to as implicit distillation, that occurs when the teacher and student encoders share the same decoder. Our experiments show 5.37-8.4% relative word error rate reductions (WERR) on in-house test sets, and 5.05-6.18% relative WERRs on LibriSpeech test sets. △ Less

Submitted 14 June, 2021; originally announced June 2021.

Comments: Accepted at InterSpeech 2021

arXiv:2106.06126 [pdf, other]

Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models

Authors: **g Liu, Rupak Vignesh Swaminathan, Sree Hari Krishnan Parthasarathi, Chunchuan Lyu, Athanasios Mouchtaris, Siegfried Kunzmann

Abstract: We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 hours of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% w… ▽ More We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 hours of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% word error rate reduction (WERR). When increasing the supervised data to seven-fold, our gains diminish to 7.1% WERR; to improve SSL efficiency at larger supervised data regimes, we employ a step-wise distillation into a smaller model, obtaining a WERR of 14.4%. We then switch to SSL using larger student models in low data regimes; while learning efficiency with unsupervised data is higher, student models may outperform teacher models in such a setting. We develop a theoretical sketch to explain this behavior. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: TSD2021

arXiv:1809.08671 [pdf, other]

Jovian vortices and jets

Authors: Glenn R. Flierl, Philip J. Morrison, Rohith Vilasur Swaminathan

Abstract: We explore the conditions required for isolated vortices to exist in sheared zonal flows and the stability of the underlying zonal winds. This is done using the standard 2-layer quasigeostrophic model with the lower layer depth becoming infinite; however, this model differs from the usual layer model because the lower layer is not assumed to be motionless but has a steady configuration of alternat… ▽ More We explore the conditions required for isolated vortices to exist in sheared zonal flows and the stability of the underlying zonal winds. This is done using the standard 2-layer quasigeostrophic model with the lower layer depth becoming infinite; however, this model differs from the usual layer model because the lower layer is not assumed to be motionless but has a steady configuration of alternating zonal flows [1]. Steady state vortices are obtained by a simulated annealing computational method introduced in [2], generalized and applied in [3] in fluid flow, and used in the context of magnetohydrodynamics in [4-6]. Various cases of vortices with a constant potential vorticity anomaly atop zonal winds and the stability of the underlying winds are considered using a mix of computational and analytical techniques. △ Less

Submitted 23 September, 2018; originally announced September 2018.

arXiv:1506.05227 [pdf, other]

doi 10.1103/PhysRevE.94.013105

Dynamics of circular arrangements of vorticity in two dimensions

Authors: Rohith V. Swaminathan, S. Ravichandran, Prasad Perlekar, Rama Govindarajan

Abstract: The merger of two like-signed vortices is a well-studied problem, but in a turbulent flow, we may often have more than two like-signed vortices interacting. We study the merger of three or more identical co-rotating vortices initially arranged on the vertices of a regular polygon. At low to moderate Reynolds numbers, we find an additional stage in the merger process, absent in the merger of two vo… ▽ More The merger of two like-signed vortices is a well-studied problem, but in a turbulent flow, we may often have more than two like-signed vortices interacting. We study the merger of three or more identical co-rotating vortices initially arranged on the vertices of a regular polygon. At low to moderate Reynolds numbers, we find an additional stage in the merger process, absent in the merger of two vortices, where an annular vortical structure is formed and is long-lived. Vortex merger is slowed down significantly due to this. Such annular vortices are known at far higher Reynolds numbers in studies of tropical cyclones, which have been noticed to a break down into individual vortices. In the pre-annular stage, vortical structures in a viscous flow are found here to tilt and realign in a manner similar to the inviscid case, but the pronounced filaments visible in the latter are practically absent in the former. Interestingly at higher Reynolds numbers, the merger of an odd number of vortices is found to proceed very differently from that of an even number. The former process is rapid and chaotic whereas the latter proceeds more slowly via pairing events. The annular vortex takes the form of a generalised Lamb-Oseen vortex (GLO), and diffuses inwards until it forms a standard Lamb-Oseen vortex. For lower Reynolds number, the numerical (fully nonlinear) evolution of the GLO vortex follows exactly the analytical evolution until merger. At higher Reynolds numbers, the annulus goes through instabilities whose nonlinear stages show a pronounced difference between even and odd mode disturbances. It is hoped that the present findings, that multiple vortex merger is qualitatively different from the merger of two vortices, will motivate studies on how multiple vortex interactions affect the inverse cascade in two-dimensional turbulence. △ Less

Submitted 21 June, 2016; v1 submitted 17 June, 2015; originally announced June 2015.

Comments: Abstract truncated. Paper to appear in Physical Review E

Journal ref: Phys. Rev. E 94, 013105 (2016)

Showing 1–8 of 8 results for author: Swaminathan, R V