Search | arXiv e-print repository

Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets

Authors: Lu Zeng, Sree Hari Krishnan Parthasarathi, Yuzong Liu, Alex Escott, Santosh Kumar Cheekatmalla, Nikko Strom, Shiv Vitaladevuni

Abstract: We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1st-stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh(.) on dense layer weights. In the 2nd-stage, we use linear quantization methods on the rest of the network, incl… ▽ More We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1st-stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh(.) on dense layer weights. In the 2nd-stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data (evaluating on 4,000 hours of data). We organize our results in two embedded chipset settings: a) with commodity ARM NEON instruction set and 8-bit containers, we present accuracy, CPU, and memory results using sub 8-bit weights (4, 5, 8-bit) and 8-bit quantization of rest of the network; b) with off-the-shelf neural network accelerators, for a range of weight bit widths (1 and 5-bit), while presenting accuracy results, we project reduction in memory utilization. In both configurations, our results show that the proposed algorithm can achieve: a) parity with a full floating point model's operating point on a detection error tradeoff (DET) curve in terms of false detection rate (FDR) at false rejection rate (FRR); b) significant reduction in compute and memory, yielding up to 3 times improvement in CPU consumption and more than 4 times improvement in memory consumption. △ Less

Submitted 8 September, 2022; v1 submitted 13 July, 2022; originally announced July 2022.

arXiv:2104.09088 [pdf, other]

Alexa Conversations: An Extensible Data-driven Approach for Building Task-oriented Dialogue Systems

Authors: Anish Acharya, Suranjit Adhikari, Sanchit Agarwal, Vincent Auvray, Nehal Belgamwar, Arijit Biswas, Shubhra Chandra, Tagyoung Chung, Maryam Fazel-Zarandi, Raefer Gabriel, Shuyang Gao, Rahul Goel, Dilek Hakkani-Tur, Jan Jezabek, Abhay Jha, Jiun-Yu Kao, Prakash Krishnan, Peter Ku, Anuj Goyal, Chien-Wei Lin, Qing Liu, Arindam Mandal, Angeliki Metallinou, Vishal Naik, Yi Pan , et al. (6 additional authors not shown)

Abstract: Traditional goal-oriented dialogue systems rely on various components such as natural language understanding, dialogue state tracking, policy learning and response generation. Training each component requires annotations which are hard to obtain for every new domain, limiting scalability of such systems. Similarly, rule-based dialogue systems require extensive writing and maintenance of rules and… ▽ More Traditional goal-oriented dialogue systems rely on various components such as natural language understanding, dialogue state tracking, policy learning and response generation. Training each component requires annotations which are hard to obtain for every new domain, limiting scalability of such systems. Similarly, rule-based dialogue systems require extensive writing and maintenance of rules and do not scale either. End-to-End dialogue systems, on the other hand, do not require module-specific annotations but need a large amount of data for training. To overcome these problems, in this demo, we present Alexa Conversations, a new approach for building goal-oriented dialogue systems that is scalable, extensible as well as data efficient. The components of this system are trained in a data-driven manner, but instead of collecting annotated conversations for training, we generate them using a novel dialogue simulator based on a few seed dialogues and specifications of APIs and entities provided by the developer. Our approach provides out-of-the-box support for natural conversational phenomena like entity sharing across turns or users changing their mind during conversation without requiring developers to provide any such dialogue flows. We exemplify our approach using a simple pizza ordering task and showcase its value in reducing the developer burden for creating a robust experience. Finally, we evaluate our system using a typical movie ticket booking task and show that the dialogue simulator is an essential component of the system that leads to over $50\%$ improvement in turn-level action signature prediction accuracy. △ Less

Submitted 19 April, 2021; originally announced April 2021.

Journal ref: NAACL 2021 System Demonstrations Track

arXiv:2104.04281 [pdf, other]

doi 10.1109/MS.2020.3020651

Managing Traceability Information Models: Not such a simple task after all?

Authors: Salome Maro, Jan-Philipp Steghöfer, Eric Knauss, Jennifer Horkoff, Rashidah Kasauli, Rebekka Wohlrab, Jesper Lysemose Korsgaard, Florian Wartenberg, Niels Jørgen Strøm, Ruben Alexandersson

Abstract: Practitioners are poorly supported by the scientific literature when managing traceability information models (TIMs), which capture the structure and semantics of trace links. In practice, companies manage their TIMs in very different ways, even in cases where companies share many similarities. We present our findings from an in-depth focus group about TIM management with three different systems e… ▽ More Practitioners are poorly supported by the scientific literature when managing traceability information models (TIMs), which capture the structure and semantics of trace links. In practice, companies manage their TIMs in very different ways, even in cases where companies share many similarities. We present our findings from an in-depth focus group about TIM management with three different systems engineering companies. We find that the concrete needs of the companies as well as challenges such as scale and workflow integration are not considered by existing scientific work. We thus issue a call-to-arms for the requirements engineering and software and systems traceability communities, the two main communities for traceability research, to refocus their work on these practical problems. △ Less

Submitted 9 April, 2021; originally announced April 2021.

Comments: ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Journal ref: IEEE Software (2020)

arXiv:1904.10584 [pdf, other]

doi 10.1109/JETCAS.2019.2912353

Realizing Petabyte Scale Acoustic Modeling

Authors: Sree Hari Krishnan Parthasarathi, Nitin Sivakrishnan, Pranav Ladkat, Nikko Strom

Abstract: Large scale machine learning (ML) systems such as the Alexa automatic speech recognition (ASR) system continue to improve with increasing amounts of manually transcribed training data. Instead of scaling manual transcription to impractical levels, we utilize semi-supervised learning (SSL) to learn acoustic models (AM) from the vast firehose of untranscribed audio data. Learning an AM from 1 Millio… ▽ More Large scale machine learning (ML) systems such as the Alexa automatic speech recognition (ASR) system continue to improve with increasing amounts of manually transcribed training data. Instead of scaling manual transcription to impractical levels, we utilize semi-supervised learning (SSL) to learn acoustic models (AM) from the vast firehose of untranscribed audio data. Learning an AM from 1 Million hours of audio presents unique ML and system design challenges. We present the design and evaluation of a highly scalable and resource efficient SSL system for AM. Employing the student/teacher learning paradigm, we focus on the student learning subsystem: a scalable and robust data pipeline that generates features and targets from raw audio, and an efficient model pipeline, including the distributed trainer, that builds a student model. Our evaluations show that, even without extensive hyper-parameter tuning, we obtain relative accuracy improvements in the 10 to 20$\%$ range, with higher gains in noisier conditions. The end-to-end processing time of this SSL system was 12 days, and several components in this system can trivially scale linearly with more compute resources. △ Less

Submitted 23 April, 2019; originally announced April 2019.

Comments: 2156-3357 ©2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information

arXiv:1904.01624 [pdf, ps, other]

Lessons from Building Acoustic Models with a Million Hours of Speech

Authors: Sree Hari Krishnan Parthasarathi, Nikko Strom

Abstract: This is a report of our lessons learned building acoustic models from 1 Million hours of unlabeled speech, while labeled speech is restricted to 7,000 hours. We employ student/teacher training on unlabeled data, hel** scale out target generation in comparison to confidence model based methods, which require a decoder and a confidence model. To optimize storage and to parallelize target generatio… ▽ More This is a report of our lessons learned building acoustic models from 1 Million hours of unlabeled speech, while labeled speech is restricted to 7,000 hours. We employ student/teacher training on unlabeled data, hel** scale out target generation in comparison to confidence model based methods, which require a decoder and a confidence model. To optimize storage and to parallelize target generation, we store high valued logits from the teacher model. Introducing the notion of scheduled learning, we interleave learning on unlabeled and labeled data. To scale distributed training across a large number of GPUs, we use BMUF with 64 GPUs, while performing sequence training only on labeled data with gradient threshold compression SGD using 16 GPUs. Our experiments show that extremely large amounts of data are indeed useful; with little hyper-parameter tuning, we obtain relative WER improvements in the 10 to 20% range, with higher gains in noisier conditions. △ Less

Submitted 2 April, 2019; originally announced April 2019.

Comments: "Copyright 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works."

arXiv:1903.06539 [pdf, other]

doi 10.1109/ICASSP.2019.8682294

Multi-Geometry Spatial Acoustic Modeling for Distant Speech Recognition

Authors: Kenichi Kumatani, Minhua Wu, Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister

Abstract: The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade speech enhancement performance when there is an array geometry mismatch between design and test conditions. Moreover, such speech enhancement techniques do not always yield ASR accuracy improvement due to the differenc… ▽ More The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade speech enhancement performance when there is an array geometry mismatch between design and test conditions. Moreover, such speech enhancement techniques do not always yield ASR accuracy improvement due to the difference between speech enhancement and ASR optimization objectives. In this work, we propose to unify an acoustic model framework by optimizing spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input. Our acoustic model subsumes beamformers with multiple types of array geometry. In contrast to deep clustering methods that treat a neural network as a black box tool, the network encoding the spatial filters can process streaming audio data in real time without the accumulation of target signal statistics. We demonstrate the effectiveness of such MC neural networks through ASR experiments on the real-world far-field data. We show that our two-channel acoustic model can on average reduce word error rates (WERs) by~13.4 and~12.7% compared to a single channel ASR system with the log-mel filter bank energy (LFBE) feature under the matched and mismatched microphone placement conditions, respectively. Our result also shows that our two-channel network achieves a relative WER reduction of over~7.0% compared to conventional beamforming with seven microphones overall. △ Less

Submitted 28 April, 2019; v1 submitted 12 March, 2019; originally announced March 2019.

Comments: ICASSP2019, 5 pages. arXiv admin note: substantial text overlap with arXiv:1903.05299

Report number: https://doi.org/10.1109/ICASSP.2019.8682294

Journal ref: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019, page 6635-6639

arXiv:1903.05299 [pdf, other]

doi 10.1109/ICASSP.2019.8682977

Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition

Authors: Minhua Wu, Kenichi Kumatani, Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister

Abstract: Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for speech enhancement in order to improve robustness against noise or reverberation. However, such speech enhancement techniques do not always yield ASR accuracy improvement because the optimization criterion for speech enhancement is not directly relevant to the ASR objective. In this w… ▽ More Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for speech enhancement in order to improve robustness against noise or reverberation. However, such speech enhancement techniques do not always yield ASR accuracy improvement because the optimization criterion for speech enhancement is not directly relevant to the ASR objective. In this work, we develop new acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly. In contrast to conventional methods, we incorporate array processing knowledge into the acoustic model. Moreover, we initialize the network with beamformers' coefficients. We investigate effects of such MC neural networks through ASR experiments on the real-world far-field data where users are interacting with an ASR system in uncontrolled acoustic environments. We show that our MC acoustic model can reduce a word error rate (WER) by~16.5\% compared to a single channel ASR system with the traditional log-mel filter bank energy (LFBE) feature on average. Our result also shows that our network with the spatial filtering layer on two-channel input achieves a relative WER reduction of~9.5\% compared to conventional beamforming with seven microphones. △ Less

Submitted 28 April, 2019; v1 submitted 12 March, 2019; originally announced March 2019.

Comments: ICASSP 2019, 5 pages

Report number: https://doi.org/10.1109/ICASSP.2019.8682977

Journal ref: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019, pages 6640-6644

arXiv:1811.06296 [pdf, other]

Comprehensive evaluation of statistical speech waveform synthesis

Authors: Thomas Merritt, Bartosz Putrycz, Adam Nadolski, Tianjun Ye, Daniel Korzekwa, Wiktor Dolecki, Thomas Drugman, Viacheslav Klimkov, Alexis Moinet, Andrew Breen, Rafal Kuklinski, Nikko Strom, Roberto Barra-Chicote

Abstract: Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality. The results of this evaluation are validated by repeating the pro… ▽ More Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality. The results of this evaluation are validated by repeating the procedure on a separate group of testers. Finally, an analysis of the nature of speech errors of SSWS compared to hybrid unit selection synthesis is conducted to identify the strengths and weaknesses of SSWS. Having a deeper insight into SSWS allows us to better define the focus of future work to improve this new technology. △ Less

Submitted 11 December, 2018; v1 submitted 15 November, 2018; originally announced November 2018.

arXiv:1808.00563 [pdf, other]

Data Augmentation for Robust Keyword Spotting under Playback Interference

Authors: Anirudh Raju, Sankaran Panchapagesan, Xing Liu, Arindam Mandal, Nikko Strom

Abstract: Accurate on-device keyword spotting (KWS) with low false accept and false reject rate is crucial to customer experience for far-field voice control of conversational agents. It is particularly challenging to maintain low false reject rate in real world conditions where there is (a) ambient noise from external sources such as TV, household appliances, or other speech that is not directed at the dev… ▽ More Accurate on-device keyword spotting (KWS) with low false accept and false reject rate is crucial to customer experience for far-field voice control of conversational agents. It is particularly challenging to maintain low false reject rate in real world conditions where there is (a) ambient noise from external sources such as TV, household appliances, or other speech that is not directed at the device (b) imperfect cancellation of the audio playback from the device, resulting in residual echo, after being processed by the Acoustic Echo Cancellation (AEC) system. In this paper, we propose a data augmentation strategy to improve keyword spotting performance under these challenging conditions. The training set audio is artificially corrupted by mixing in music and TV/movie audio, at different signal to interference ratios. Our results show that we get around 30-45% relative reduction in false reject rates, at a range of false alarm rates, under audio playback from such devices. △ Less

Submitted 1 August, 2018; originally announced August 2018.

arXiv:1705.02411 [pdf, other]

doi 10.1109/SLT.2016.7846306

Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting

Authors: Ming Sun, Anirudh Raju, George Tucker, Sankaran Panchapagesan, Gengshen Fu, Arindam Mandal, Spyros Matsoukas, Nikko Strom, Shiv Vitaladevuni

Abstract: We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance.… ▽ More We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields $67.6\%$ relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure. △ Less

Submitted 5 May, 2017; originally announced May 2017.

Journal ref: Spoken Language Technology Workshop (SLT), 2016 IEEE (pp. 474-480). IEEE

Showing 1–10 of 10 results for author: Strom, N