-
Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets
Authors:
Lu Zeng,
Sree Hari Krishnan Parthasarathi,
Yuzong Liu,
Alex Escott,
Santosh Kumar Cheekatmalla,
Nikko Strom,
Shiv Vitaladevuni
Abstract:
We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1st-stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh(.) on dense layer weights. In the 2nd-stage, we use linear quantization methods on the rest of the network, incl…
▽ More
We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1st-stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh(.) on dense layer weights. In the 2nd-stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data (evaluating on 4,000 hours of data). We organize our results in two embedded chipset settings: a) with commodity ARM NEON instruction set and 8-bit containers, we present accuracy, CPU, and memory results using sub 8-bit weights (4, 5, 8-bit) and 8-bit quantization of rest of the network; b) with off-the-shelf neural network accelerators, for a range of weight bit widths (1 and 5-bit), while presenting accuracy results, we project reduction in memory utilization. In both configurations, our results show that the proposed algorithm can achieve: a) parity with a full floating point model's operating point on a detection error tradeoff (DET) curve in terms of false detection rate (FDR) at false rejection rate (FRR); b) significant reduction in compute and memory, yielding up to 3 times improvement in CPU consumption and more than 4 times improvement in memory consumption.
△ Less
Submitted 8 September, 2022; v1 submitted 13 July, 2022;
originally announced July 2022.
-
Alexa Conversations: An Extensible Data-driven Approach for Building Task-oriented Dialogue Systems
Authors:
Anish Acharya,
Suranjit Adhikari,
Sanchit Agarwal,
Vincent Auvray,
Nehal Belgamwar,
Arijit Biswas,
Shubhra Chandra,
Tagyoung Chung,
Maryam Fazel-Zarandi,
Raefer Gabriel,
Shuyang Gao,
Rahul Goel,
Dilek Hakkani-Tur,
Jan Jezabek,
Abhay Jha,
Jiun-Yu Kao,
Prakash Krishnan,
Peter Ku,
Anuj Goyal,
Chien-Wei Lin,
Qing Liu,
Arindam Mandal,
Angeliki Metallinou,
Vishal Naik,
Yi Pan
, et al. (6 additional authors not shown)
Abstract:
Traditional goal-oriented dialogue systems rely on various components such as natural language understanding, dialogue state tracking, policy learning and response generation. Training each component requires annotations which are hard to obtain for every new domain, limiting scalability of such systems. Similarly, rule-based dialogue systems require extensive writing and maintenance of rules and…
▽ More
Traditional goal-oriented dialogue systems rely on various components such as natural language understanding, dialogue state tracking, policy learning and response generation. Training each component requires annotations which are hard to obtain for every new domain, limiting scalability of such systems. Similarly, rule-based dialogue systems require extensive writing and maintenance of rules and do not scale either. End-to-End dialogue systems, on the other hand, do not require module-specific annotations but need a large amount of data for training. To overcome these problems, in this demo, we present Alexa Conversations, a new approach for building goal-oriented dialogue systems that is scalable, extensible as well as data efficient. The components of this system are trained in a data-driven manner, but instead of collecting annotated conversations for training, we generate them using a novel dialogue simulator based on a few seed dialogues and specifications of APIs and entities provided by the developer. Our approach provides out-of-the-box support for natural conversational phenomena like entity sharing across turns or users changing their mind during conversation without requiring developers to provide any such dialogue flows. We exemplify our approach using a simple pizza ordering task and showcase its value in reducing the developer burden for creating a robust experience. Finally, we evaluate our system using a typical movie ticket booking task and show that the dialogue simulator is an essential component of the system that leads to over $50\%$ improvement in turn-level action signature prediction accuracy.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
Managing Traceability Information Models: Not such a simple task after all?
Authors:
Salome Maro,
Jan-Philipp Steghöfer,
Eric Knauss,
Jennifer Horkoff,
Rashidah Kasauli,
Rebekka Wohlrab,
Jesper Lysemose Korsgaard,
Florian Wartenberg,
Niels Jørgen Strøm,
Ruben Alexandersson
Abstract:
Practitioners are poorly supported by the scientific literature when managing traceability information models (TIMs), which capture the structure and semantics of trace links. In practice, companies manage their TIMs in very different ways, even in cases where companies share many similarities. We present our findings from an in-depth focus group about TIM management with three different systems e…
▽ More
Practitioners are poorly supported by the scientific literature when managing traceability information models (TIMs), which capture the structure and semantics of trace links. In practice, companies manage their TIMs in very different ways, even in cases where companies share many similarities. We present our findings from an in-depth focus group about TIM management with three different systems engineering companies. We find that the concrete needs of the companies as well as challenges such as scale and workflow integration are not considered by existing scientific work. We thus issue a call-to-arms for the requirements engineering and software and systems traceability communities, the two main communities for traceability research, to refocus their work on these practical problems.
△ Less
Submitted 9 April, 2021;
originally announced April 2021.
-
Realizing Petabyte Scale Acoustic Modeling
Authors:
Sree Hari Krishnan Parthasarathi,
Nitin Sivakrishnan,
Pranav Ladkat,
Nikko Strom
Abstract:
Large scale machine learning (ML) systems such as the Alexa automatic speech recognition (ASR) system continue to improve with increasing amounts of manually transcribed training data. Instead of scaling manual transcription to impractical levels, we utilize semi-supervised learning (SSL) to learn acoustic models (AM) from the vast firehose of untranscribed audio data. Learning an AM from 1 Millio…
▽ More
Large scale machine learning (ML) systems such as the Alexa automatic speech recognition (ASR) system continue to improve with increasing amounts of manually transcribed training data. Instead of scaling manual transcription to impractical levels, we utilize semi-supervised learning (SSL) to learn acoustic models (AM) from the vast firehose of untranscribed audio data. Learning an AM from 1 Million hours of audio presents unique ML and system design challenges. We present the design and evaluation of a highly scalable and resource efficient SSL system for AM. Employing the student/teacher learning paradigm, we focus on the student learning subsystem: a scalable and robust data pipeline that generates features and targets from raw audio, and an efficient model pipeline, including the distributed trainer, that builds a student model. Our evaluations show that, even without extensive hyper-parameter tuning, we obtain relative accuracy improvements in the 10 to 20$\%$ range, with higher gains in noisier conditions. The end-to-end processing time of this SSL system was 12 days, and several components in this system can trivially scale linearly with more compute resources.
△ Less
Submitted 23 April, 2019;
originally announced April 2019.
-
Lessons from Building Acoustic Models with a Million Hours of Speech
Authors:
Sree Hari Krishnan Parthasarathi,
Nikko Strom
Abstract:
This is a report of our lessons learned building acoustic models from 1 Million hours of unlabeled speech, while labeled speech is restricted to 7,000 hours. We employ student/teacher training on unlabeled data, hel** scale out target generation in comparison to confidence model based methods, which require a decoder and a confidence model. To optimize storage and to parallelize target generatio…
▽ More
This is a report of our lessons learned building acoustic models from 1 Million hours of unlabeled speech, while labeled speech is restricted to 7,000 hours. We employ student/teacher training on unlabeled data, hel** scale out target generation in comparison to confidence model based methods, which require a decoder and a confidence model. To optimize storage and to parallelize target generation, we store high valued logits from the teacher model. Introducing the notion of scheduled learning, we interleave learning on unlabeled and labeled data. To scale distributed training across a large number of GPUs, we use BMUF with 64 GPUs, while performing sequence training only on labeled data with gradient threshold compression SGD using 16 GPUs. Our experiments show that extremely large amounts of data are indeed useful; with little hyper-parameter tuning, we obtain relative WER improvements in the 10 to 20% range, with higher gains in noisier conditions.
△ Less
Submitted 2 April, 2019;
originally announced April 2019.
-
Multi-Geometry Spatial Acoustic Modeling for Distant Speech Recognition
Authors:
Kenichi Kumatani,
Minhua Wu,
Shiva Sundaram,
Nikko Strom,
Bjorn Hoffmeister
Abstract:
The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade speech enhancement performance when there is an array geometry mismatch between design and test conditions. Moreover, such speech enhancement techniques do not always yield ASR accuracy improvement due to the differenc…
▽ More
The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade speech enhancement performance when there is an array geometry mismatch between design and test conditions. Moreover, such speech enhancement techniques do not always yield ASR accuracy improvement due to the difference between speech enhancement and ASR optimization objectives. In this work, we propose to unify an acoustic model framework by optimizing spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input. Our acoustic model subsumes beamformers with multiple types of array geometry. In contrast to deep clustering methods that treat a neural network as a black box tool, the network encoding the spatial filters can process streaming audio data in real time without the accumulation of target signal statistics. We demonstrate the effectiveness of such MC neural networks through ASR experiments on the real-world far-field data. We show that our two-channel acoustic model can on average reduce word error rates (WERs) by~13.4 and~12.7% compared to a single channel ASR system with the log-mel filter bank energy (LFBE) feature under the matched and mismatched microphone placement conditions, respectively. Our result also shows that our two-channel network achieves a relative WER reduction of over~7.0% compared to conventional beamforming with seven microphones overall.
△ Less
Submitted 28 April, 2019; v1 submitted 12 March, 2019;
originally announced March 2019.
-
Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition
Authors:
Minhua Wu,
Kenichi Kumatani,
Shiva Sundaram,
Nikko Strom,
Bjorn Hoffmeister
Abstract:
Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for speech enhancement in order to improve robustness against noise or reverberation. However, such speech enhancement techniques do not always yield ASR accuracy improvement because the optimization criterion for speech enhancement is not directly relevant to the ASR objective. In this w…
▽ More
Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for speech enhancement in order to improve robustness against noise or reverberation. However, such speech enhancement techniques do not always yield ASR accuracy improvement because the optimization criterion for speech enhancement is not directly relevant to the ASR objective. In this work, we develop new acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly. In contrast to conventional methods, we incorporate array processing knowledge into the acoustic model. Moreover, we initialize the network with beamformers' coefficients. We investigate effects of such MC neural networks through ASR experiments on the real-world far-field data where users are interacting with an ASR system in uncontrolled acoustic environments. We show that our MC acoustic model can reduce a word error rate (WER) by~16.5\% compared to a single channel ASR system with the traditional log-mel filter bank energy (LFBE) feature on average. Our result also shows that our network with the spatial filtering layer on two-channel input achieves a relative WER reduction of~9.5\% compared to conventional beamforming with seven microphones.
△ Less
Submitted 28 April, 2019; v1 submitted 12 March, 2019;
originally announced March 2019.
-
Comprehensive evaluation of statistical speech waveform synthesis
Authors:
Thomas Merritt,
Bartosz Putrycz,
Adam Nadolski,
Tianjun Ye,
Daniel Korzekwa,
Wiktor Dolecki,
Thomas Drugman,
Viacheslav Klimkov,
Alexis Moinet,
Andrew Breen,
Rafal Kuklinski,
Nikko Strom,
Roberto Barra-Chicote
Abstract:
Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality. The results of this evaluation are validated by repeating the pro…
▽ More
Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality. The results of this evaluation are validated by repeating the procedure on a separate group of testers. Finally, an analysis of the nature of speech errors of SSWS compared to hybrid unit selection synthesis is conducted to identify the strengths and weaknesses of SSWS. Having a deeper insight into SSWS allows us to better define the focus of future work to improve this new technology.
△ Less
Submitted 11 December, 2018; v1 submitted 15 November, 2018;
originally announced November 2018.
-
Data Augmentation for Robust Keyword Spotting under Playback Interference
Authors:
Anirudh Raju,
Sankaran Panchapagesan,
Xing Liu,
Arindam Mandal,
Nikko Strom
Abstract:
Accurate on-device keyword spotting (KWS) with low false accept and false reject rate is crucial to customer experience for far-field voice control of conversational agents. It is particularly challenging to maintain low false reject rate in real world conditions where there is (a) ambient noise from external sources such as TV, household appliances, or other speech that is not directed at the dev…
▽ More
Accurate on-device keyword spotting (KWS) with low false accept and false reject rate is crucial to customer experience for far-field voice control of conversational agents. It is particularly challenging to maintain low false reject rate in real world conditions where there is (a) ambient noise from external sources such as TV, household appliances, or other speech that is not directed at the device (b) imperfect cancellation of the audio playback from the device, resulting in residual echo, after being processed by the Acoustic Echo Cancellation (AEC) system. In this paper, we propose a data augmentation strategy to improve keyword spotting performance under these challenging conditions. The training set audio is artificially corrupted by mixing in music and TV/movie audio, at different signal to interference ratios. Our results show that we get around 30-45% relative reduction in false reject rates, at a range of false alarm rates, under audio playback from such devices.
△ Less
Submitted 1 August, 2018;
originally announced August 2018.
-
Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting
Authors:
Ming Sun,
Anirudh Raju,
George Tucker,
Sankaran Panchapagesan,
Gengshen Fu,
Arindam Mandal,
Spyros Matsoukas,
Nikko Strom,
Shiv Vitaladevuni
Abstract:
We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance.…
▽ More
We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields $67.6\%$ relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.
△ Less
Submitted 5 May, 2017;
originally announced May 2017.