-
End-to-End real time tracking of children's reading with pointer network
Authors:
Vishal Sunder,
Beulah Karrolla,
Eric Fosler-Lussier
Abstract:
In this work, we explore how a real time reading tracker can be built efficiently for children's voices. While previously proposed reading trackers focused on ASR-based cascaded approaches, we propose a fully end-to-end model making it less prone to lags in voice tracking. We employ a pointer network that directly learns to predict positions in the ground truth text conditioned on the streaming sp…
▽ More
In this work, we explore how a real time reading tracker can be built efficiently for children's voices. While previously proposed reading trackers focused on ASR-based cascaded approaches, we propose a fully end-to-end model making it less prone to lags in voice tracking. We employ a pointer network that directly learns to predict positions in the ground truth text conditioned on the streaming speech. To train this pointer network, we generate ground truth training signals by using forced alignment between the read speech and the text being read on the training set. Exploring different forced alignment models, we find a neural attention based model is at least as close in alignment accuracy to the Montreal Forced Aligner, but surprisingly is a better training signal for the pointer network. Our results are reported on one adult speech data (TIMIT) and two children's speech datasets (CMU Kids and Reading Races). Our best model can accurately track adult speech with 87.8% accuracy and the much harder and disfluent children's speech with 77.1% accuracy on CMU Kids data and a 65.3% accuracy on the Reading Races dataset.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems
Authors:
Vishal Sunder,
Eric Fosler-Lussier,
Samuel Thomas,
Hong-Kwang J. Kuo,
Brian Kingsbury
Abstract:
Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to effective pretraining of speech representations. One such pretraining paradigm is the distillation of semantic knowledge from state-of-the-art text-based models like BERT to speech encoder neural networks. This work is a step towards doing the same in a much more efficient and fine-grained manner whe…
▽ More
Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to effective pretraining of speech representations. One such pretraining paradigm is the distillation of semantic knowledge from state-of-the-art text-based models like BERT to speech encoder neural networks. This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis. We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder such that these can be directly compared and aligned with BERT based contextual embeddings. This alignment is performed using a novel tokenwise contrastive loss. Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets. Our model improves further when fine-tuned with additional regularization using SpecAugment especially when speech is noisy, giving an absolute improvement as high as 8% over previous results.
△ Less
Submitted 1 July, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
Building an ASR Error Robust Spoken Virtual Patient System in a Highly Class-Imbalanced Scenario Without Speech Data
Authors:
Vishal Sunder,
Prashant Serai,
Eric Fosler-Lussier
Abstract:
A Virtual Patient (VP) is a powerful tool for training medical students to take patient histories, where responding to a diverse set of spoken questions is essential to simulate natural conversations with a student. The performance of such a Spoken Language Understanding system (SLU) can be adversely affected by both the presence of Automatic Speech Recognition (ASR) errors in the test data and a…
▽ More
A Virtual Patient (VP) is a powerful tool for training medical students to take patient histories, where responding to a diverse set of spoken questions is essential to simulate natural conversations with a student. The performance of such a Spoken Language Understanding system (SLU) can be adversely affected by both the presence of Automatic Speech Recognition (ASR) errors in the test data and a high degree of class imbalance in the SLU training data. While these two issues have been addressed separately in prior work, we develop a novel two-step training methodology that tackles both these issues effectively in a single dialog agent. As it is difficult to collect spoken data from users without a functioning SLU system, our method does not rely on spoken data for training, rather we use an ASR error predictor to "speechify" the text data. Our method shows significant improvements over strong baselines on the VP intent classification task at various word error rate settings.
△ Less
Submitted 1 July, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
Towards End-to-End Integration of Dialog History for Improved Spoken Language Understanding
Authors:
Vishal Sunder,
Samuel Thomas,
Hong-Kwang J. Kuo,
Jatin Ganhotra,
Brian Kingsbury,
Eric Fosler-Lussier
Abstract:
Dialog history plays an important role in spoken language understanding (SLU) performance in a dialog system. For end-to-end (E2E) SLU, previous work has used dialog history in text form, which makes the model dependent on a cascaded automatic speech recognizer (ASR). This rescinds the benefits of an E2E system which is intended to be compact and robust to ASR errors. In this paper, we propose a h…
▽ More
Dialog history plays an important role in spoken language understanding (SLU) performance in a dialog system. For end-to-end (E2E) SLU, previous work has used dialog history in text form, which makes the model dependent on a cascaded automatic speech recognizer (ASR). This rescinds the benefits of an E2E system which is intended to be compact and robust to ASR errors. In this paper, we propose a hierarchical conversation model that is capable of directly using dialog history in speech form, making it fully E2E. We also distill semantic knowledge from the available gold conversation transcripts by jointly training a similar text-based conversation model with an explicit tying of acoustic and semantic embeddings. We also propose a novel technique that we call DropFrame to deal with the long training time incurred by adding dialog history in an E2E manner. On the HarperValleyBank dialog dataset, our E2E history integration outperforms a history independent baseline by 7.7% absolute F1 score on the task of dialog action recognition. Our model performs competitively with the state-of-the-art history based cascaded baseline, but uses 48% fewer parameters. In the absence of gold transcripts to fine-tune an ASR model, our model outperforms this baseline by a significant margin of 10% absolute F1 score.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
Hallucination of speech recognition errors with sequence to sequence learning
Authors:
Prashant Serai,
Vishal Sunder,
Eric Fosler-Lussier
Abstract:
Automatic Speech Recognition (ASR) is an imperfect process that results in certain mismatches in ASR output text when compared to plain written text or transcriptions. When plain text data is to be used to train systems for spoken language understanding or ASR, a proven strategy to reduce said mismatch and prevent degradations, is to hallucinate what the ASR outputs would be given a gold transcrip…
▽ More
Automatic Speech Recognition (ASR) is an imperfect process that results in certain mismatches in ASR output text when compared to plain written text or transcriptions. When plain text data is to be used to train systems for spoken language understanding or ASR, a proven strategy to reduce said mismatch and prevent degradations, is to hallucinate what the ASR outputs would be given a gold transcription. Prior work in this domain has focused on modeling errors at the phonetic level, while using a lexicon to convert the phones to words, usually accompanied by an FST Language model. We present novel end-to-end models to directly predict hallucinated ASR word sequence outputs, conditioning on an input word sequence as well as a corresponding phoneme sequence. This improves prior published results for recall of errors from an in-domain ASR system's transcription of unseen data, as well as an out-of-domain ASR system's transcriptions of audio from an unrelated task, while additionally exploring an in-between scenario when limited characterization data from the test ASR system is obtainable. To verify the extrinsic validity of the method, we also use our hallucinated ASR errors to augment training for a spoken question classifier, finding that they enable robustness to real ASR errors in a downstream task, when scarce or even zero task-specific audio was available at train-time.
△ Less
Submitted 31 March, 2021; v1 submitted 22 March, 2021;
originally announced March 2021.
-
Handling Class Imbalance in Low-Resource Dialogue Systems by Combining Few-Shot Classification and Interpolation
Authors:
Vishal Sunder,
Eric Fosler-Lussier
Abstract:
Utterance classification performance in low-resource dialogue systems is constrained by an inevitably high degree of data imbalance in class labels. We present a new end-to-end pairwise learning framework that is designed specifically to tackle this phenomenon by inducing a few-shot classification capability in the utterance representations and augmenting data through an interpolation of utterance…
▽ More
Utterance classification performance in low-resource dialogue systems is constrained by an inevitably high degree of data imbalance in class labels. We present a new end-to-end pairwise learning framework that is designed specifically to tackle this phenomenon by inducing a few-shot classification capability in the utterance representations and augmenting data through an interpolation of utterance representations. Our approach is a general purpose training methodology, agnostic to the neural architecture used for encoding utterances. We show significant improvements in macro-F1 score over standard cross-entropy training for three different neural architectures, demonstrating improvements on a Virtual Patient dialogue dataset as well as a low-resourced emulation of the Switchboard dialogue act classification dataset.
△ Less
Submitted 28 October, 2020;
originally announced October 2020.
-
One-shot Information Extraction from Document Images using Neuro-Deductive Program Synthesis
Authors:
Vishal Sunder,
Ashwin Srinivasan,
Lovekesh Vig,
Gautam Shroff,
Rohit Rahul
Abstract:
Our interest in this paper is in meeting a rapidly growing industrial demand for information extraction from images of documents such as invoices, bills, receipts etc. In practice users are able to provide a very small number of example images labeled with the information that needs to be extracted. We adopt a novel two-level neuro-deductive, approach where (a) we use pre-trained deep neural netwo…
▽ More
Our interest in this paper is in meeting a rapidly growing industrial demand for information extraction from images of documents such as invoices, bills, receipts etc. In practice users are able to provide a very small number of example images labeled with the information that needs to be extracted. We adopt a novel two-level neuro-deductive, approach where (a) we use pre-trained deep neural networks to populate a relational database with facts about each document-image; and (b) we use a form of deductive reasoning, related to meta-interpretive learning of transition systems to learn extraction programs: Given task-specific transitions defined using the entities and relations identified by the neural detectors and a small number of instances (usually 1, sometimes 2) of images and the desired outputs, a resource-bounded meta-interpreter constructs proofs for the instance(s) via logical deduction; a set of logic programs that extract each desired entity is easily synthesized from such proofs. In most cases a single training example together with a noisy-clone of itself suffices to learn a program-set that generalizes well on test documents, at which time the value of each entity is determined by a majority vote across its program-set. We demonstrate our two-level neuro-deductive approach on publicly available datasets ("Patent" and "Doctor's Bills") and also describe its use in a real-life industrial problem.
△ Less
Submitted 6 June, 2019;
originally announced June 2019.
-
Prosocial or Selfish? Agents with different behaviors for Contract Negotiation using Reinforcement Learning
Authors:
Vishal Sunder,
Lovekesh Vig,
Arnab Chatterjee,
Gautam Shroff
Abstract:
We present an effective technique for training deep learning agents capable of negotiating on a set of clauses in a contract agreement using a simple communication protocol. We use Multi Agent Reinforcement Learning to train both agents simultaneously as they negotiate with each other in the training environment. We also model selfish and prosocial behavior to varying degrees in these agents. Empi…
▽ More
We present an effective technique for training deep learning agents capable of negotiating on a set of clauses in a contract agreement using a simple communication protocol. We use Multi Agent Reinforcement Learning to train both agents simultaneously as they negotiate with each other in the training environment. We also model selfish and prosocial behavior to varying degrees in these agents. Empirical evidence is provided showing consistency in agent behaviors. We further train a meta agent with a mixture of behaviors by learning an ensemble of different models using reinforcement learning. Finally, to ascertain the deployability of the negotiating agents, we conducted experiments pitting the trained agents against human players. Results demonstrate that the agents are able to hold their own against human players, often emerging as winners in the negotiation. Our experiments demonstrate that the meta agent is able to reasonably emulate human behavior.
△ Less
Submitted 19 September, 2018;
originally announced September 2018.
-
CIKM AnalytiCup 2017 Lazada Product Title Quality Challenge An Ensemble of Deep and Shallow Learning to predict the Quality of Product Titles
Authors:
Karamjit Singh,
Vishal Sunder
Abstract:
We present an approach where two different models (Deep and Shallow) are trained separately on the data and a weighted average of the outputs is taken as the final result. For the Deep approach, we use different combinations of models like Convolution Neural Network, pretrained word2vec embeddings and LSTMs to get representations which are then used to train a Deep Neural Network. For Clarity pred…
▽ More
We present an approach where two different models (Deep and Shallow) are trained separately on the data and a weighted average of the outputs is taken as the final result. For the Deep approach, we use different combinations of models like Convolution Neural Network, pretrained word2vec embeddings and LSTMs to get representations which are then used to train a Deep Neural Network. For Clarity prediction, we also use an Attentive Pooling approach for the pooling operation so as to be aware of the Title-Category pair. For the shallow approach, we use boosting technique LightGBM on features generated using title and categories. We find that an ensemble of these approaches does a better job than using them alone suggesting that the results of the deep and shallow approach are highly complementary
△ Less
Submitted 1 April, 2018;
originally announced April 2018.