-
Unified Modeling of Multi-Domain Multi-Device ASR Systems
Authors:
Soumyajit Mitra,
Swayambhu Nath Ray,
Bharat Padi,
Arunasish Sen,
Raghavendra Bilgi,
Harish Arsikere,
Shalini Ghosh,
Ajay Srinivasamurthy,
Sri Garimella
Abstract:
Modern Automatic Speech Recognition (ASR) systems often use a portfolio of domain-specific models in order to get high accuracy for distinct user utterance types across different devices. In this paper, we propose an innovative approach that integrates the different per-domain per-device models into a unified model, using a combination of domain embedding, domain experts, mixture of experts and ad…
▽ More
Modern Automatic Speech Recognition (ASR) systems often use a portfolio of domain-specific models in order to get high accuracy for distinct user utterance types across different devices. In this paper, we propose an innovative approach that integrates the different per-domain per-device models into a unified model, using a combination of domain embedding, domain experts, mixture of experts and adversarial training. We run careful ablation studies to show the benefit of each of these innovations in contributing to the accuracy of the overall unified model. Experiments show that our proposed unified modeling approach actually outperforms the carefully tuned per-domain models, giving relative gains of up to 10% over a baseline model with negligible increase in the number of parameters.
△ Less
Submitted 13 October, 2022; v1 submitted 13 May, 2022;
originally announced May 2022.
-
Timestam** Documents and Beliefs
Authors:
Swayambhu Nath Ray
Abstract:
Most of the textual information available to us are temporally variable. In a world where information is dynamic, time-stam** them is a very important task. Documents are a good source of information and are used for many tasks like, sentiment analysis, classification of reviews etc. The knowledge of creation date of documents facilitates several tasks like summarization, event extraction, tempo…
▽ More
Most of the textual information available to us are temporally variable. In a world where information is dynamic, time-stam** them is a very important task. Documents are a good source of information and are used for many tasks like, sentiment analysis, classification of reviews etc. The knowledge of creation date of documents facilitates several tasks like summarization, event extraction, temporally focused information extraction etc. Unfortunately, for most of the documents on the web, the time-stamp meta-data is either erroneous or missing. Thus document dating is a challenging problem which requires inference over the temporal structure of the document alongside the contextual information of the document. Prior document dating systems have largely relied on handcrafted features while ignoring such document-internal structures. In this paper we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach which jointly exploits syntactic and temporal graph structures of document in a principled way. We also pointed out some limitations of NeuralDater and tried to utilize both context and temporal information in documents in a more flexible and intuitive manner proposing AD3: Attentive Deep Document Dater, an attention-based document dating system. To the best of our knowledge these are the first application of deep learning methods for the task. Through extensive experiments on real-world datasets, we find that our models significantly outperforms state-of-the-art baselines by a significant margin.
△ Less
Submitted 8 June, 2021;
originally announced June 2021.
-
Improving RNN-T ASR Performance with Date-Time and Location Awareness
Authors:
Swayambhu Nath Ray,
Soumyajit Mitra,
Raghavendra Bilgi,
Sri Garimella
Abstract:
In this paper, we explore the benefits of incorporating context into a Recurrent Neural Network (RNN-T) based Automatic Speech Recognition (ASR) model to improve the speech recognition for virtual assistants. Specifically, we use meta information extracted from the time at which the utterance is spoken and the approximate location information to make ASR context aware. We show that these contextua…
▽ More
In this paper, we explore the benefits of incorporating context into a Recurrent Neural Network (RNN-T) based Automatic Speech Recognition (ASR) model to improve the speech recognition for virtual assistants. Specifically, we use meta information extracted from the time at which the utterance is spoken and the approximate location information to make ASR context aware. We show that these contextual information, when used individually, improves overall performance by as much as 3.48% relative to the baseline and when the contexts are combined, the model learns complementary features and the recognition improves by 4.62%. On specific domains, these contextual signals show improvements as high as 11.5%, without any significant degradation on others. We ran experiments with models trained on data of sizes 30K hours and 10K hours. We show that the scale of improvement with the 10K hours dataset is much higher than the one obtained with 30K hours dataset. Our results indicate that with limited data to train the ASR model, contextual signals can improve the performance significantly.
△ Less
Submitted 16 June, 2021; v1 submitted 11 June, 2021;
originally announced June 2021.
-
Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End
Authors:
Swayambhu Nath Ray,
Minhua Wu,
Anirudh Raju,
Pegah Ghahremani,
Raghavendra Bilgi,
Milind Rao,
Harish Arsikere,
Ariya Rastrow,
Andreas Stolcke,
Jasha Droppo
Abstract:
Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent…
▽ More
Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent of the utterance in the form of embeddings or posteriors, and these are used as auxiliary inputs for RNN-T training and inference. Experimenting with a 50k-hour far-field English speech corpus, this study shows that when running the system in non-streaming mode, where intent representation is extracted from the entire utterance and then used to bias streaming RNN-T search from the start, it provides a 5.56% relative word error rate reduction (WERR). On the other hand, a streaming system using per-frame intent posteriors as extra inputs for the RNN-T ASR system yields a 3.33% relative WERR. A further detailed analysis of the streaming system indicates that our proposed method brings especially good gain on media-playing related intents (e.g. 9.12% relative WERR on PlayMusicIntent).
△ Less
Submitted 16 June, 2021; v1 submitted 14 May, 2021;
originally announced May 2021.
-
AD3: Attentive Deep Document Dater
Authors:
Swayambhu Nath Ray,
Shib Sankar Dasgupta,
Partha Talukdar
Abstract:
Knowledge of the creation date of documents facilitates several tasks such as summarization, event extraction, temporally focused information extraction etc. Unfortunately, for most of the documents on the Web, the time-stamp metadata is either missing or can't be trusted. Thus, predicting creation time from document content itself is an important task. In this paper, we propose Attentive Deep Doc…
▽ More
Knowledge of the creation date of documents facilitates several tasks such as summarization, event extraction, temporally focused information extraction etc. Unfortunately, for most of the documents on the Web, the time-stamp metadata is either missing or can't be trusted. Thus, predicting creation time from document content itself is an important task. In this paper, we propose Attentive Deep Document Dater (AD3), an attention-based neural document dating system which utilizes both context and temporal information in documents in a flexible and principled manner. We perform extensive experimentation on multiple real-world datasets to demonstrate the effectiveness of AD3 over neural and non-neural baselines.
△ Less
Submitted 21 January, 2019;
originally announced February 2019.
-
Dating Documents using Graph Convolution Networks
Authors:
Shikhar Vashishth,
Shib Sankar Dasgupta,
Swayambhu Nath Ray,
Partha Talukdar
Abstract:
Document date is essential for many important tasks, such as document retrieval, summarization, event detection, etc. While existing approaches for these tasks assume accurate knowledge of the document date, this is not always available, especially for arbitrary documents from the Web. Document Dating is a challenging problem which requires inference over the temporal structure of the document. Pr…
▽ More
Document date is essential for many important tasks, such as document retrieval, summarization, event detection, etc. While existing approaches for these tasks assume accurate knowledge of the document date, this is not always available, especially for arbitrary documents from the Web. Document Dating is a challenging problem which requires inference over the temporal structure of the document. Prior document dating systems have largely relied on handcrafted features while ignoring such document internal structures. In this paper, we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach which jointly exploits syntactic and temporal graph structures of document in a principled way. To the best of our knowledge, this is the first application of deep learning for the problem of document dating. Through extensive experiments on real-world datasets, we find that NeuralDater significantly outperforms state-of-the-art baseline by 19% absolute (45% relative) accuracy points.
△ Less
Submitted 31 January, 2019;
originally announced February 2019.