Skip to main content

Showing 1–11 of 11 results for author: Dhir, C

.
  1. arXiv:2305.09681  [pdf, other

    eess.AS cs.SD

    Continual Learning for End-to-End ASR by Averaging Domain Experts

    Authors: Peter Plantinga, Jaekwon Yoo, Chandra Dhir

    Abstract: Continual learning for end-to-end automatic speech recognition has to contend with a number of difficulties. Fine-tuning strategies tend to lose performance on data already seen, a process known as catastrophic forgetting. On the other hand, strategies that freeze parameters and append tunable parameters must maintain multiple models. We suggest a strategy that maintains only a single model for in… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

    Comments: Submitted to INTERSPEECH 2023

  2. arXiv:2204.02455  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Voice Trigger Detection with Metric Learning

    Authors: Prateeth Nayak, Takuya Higuchi, Anmol Gupta, Shivesh Ranjan, Stephen Shum, Siddharth Sigtia, Erik Marchi, Varun Lakshminarasimhan, Minsik Cho, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

    Abstract: Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented… ▽ More

    Submitted 13 September, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Accepted at InterSpeech 2022

  3. arXiv:2203.15975  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

    Authors: Vineet Garg, Ognjen Rudovic, Pranay Dighe, Ahmed H. Abdelaziz, Erik Marchi, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

    Abstract: We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to accidental button presses is critical for user experience. While the majority of approaches to false trigger mitigation (FTM) are designed to detect the presence of a t… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022

  4. arXiv:2107.07634  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Multi-task Learning with Cross Attention for Keyword Spotting

    Authors: Takuya Higuchi, Anmol Gupta, Chandra Dhir

    Abstract: Keyword spotting (KWS) is an important technique for speech applications, which enables users to activate devices by speaking a keyword phrase. Although a phoneme classifier can be used for KWS, exploiting a large amount of transcribed data for automatic speech recognition (ASR), there is a mismatch between the training criterion (phoneme recognition) and the target task (KWS). Recently, multi-tas… ▽ More

    Submitted 22 September, 2021; v1 submitted 15 July, 2021; originally announced July 2021.

    Comments: Accepted at ASRU 2021

  5. arXiv:2105.06598  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation

    Authors: Vineet Garg, Wonil Chang, Siddharth Sigtia, Saurabh Adya, Pramod Simha, Pranay Dighe, Chandra Dhir

    Abstract: We present a unified and hardware efficient architecture for two stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two stage VTD systems of voice assistants can get falsely activated to audio segments acoustically similar to the trigger phrase of interest. FTM systems cancel such activations by using post trigger audio context. Traditional FTM systems rely on automatic… ▽ More

    Submitted 13 May, 2021; originally announced May 2021.

  6. arXiv:2102.09666  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Dynamic curriculum learning via data parameters for noise robust keyword spotting

    Authors: Takuya Higuchi, Shreyas Saxena, Mehrez Souden, Tien Dung Tran, Masood Delfarah, Chandra Dhir

    Abstract: We propose dynamic curriculum learning via data parameters for noise robust keyword spotting. Data parameter learning has recently been introduced for image processing, where weight parameters, so-called data parameters, for target classes and instances are introduced and optimized along with model parameters. The data parameters scale logits and control importance over classes and instances durin… ▽ More

    Submitted 18 February, 2021; originally announced February 2021.

    Comments: Accepted at ICASSP 2021

  7. arXiv:2011.01151  [pdf, other

    cs.SD cs.LG eess.AS

    Optimize what matters: Training DNN-HMM Keyword Spotting Model Using End Metric

    Authors: Ashish Shrivastava, Arnav Kundu, Chandra Dhir, Devang Naik, Oncel Tuzel

    Abstract: Deep Neural Network--Hidden Markov Model (DNN-HMM) based methods have been successfully used for many always-on keyword spotting algorithms that detect a wake word to trigger a device. The DNN predicts the state probabilities of a given speech frame, while HMM decoder combines the DNN predictions of multiple speech frames to compute the keyword detection score. The DNN, in prior methods, is traine… ▽ More

    Submitted 25 February, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: Accepted at ICASSP 2021

  8. arXiv:2008.03405  [pdf, other

    eess.AS cs.SD

    Stacked 1D convolutional networks for end-to-end small footprint voice trigger detection

    Authors: Takuya Higuchi, Mohammad Ghasemzadeh, Kisun You, Chandra Dhir

    Abstract: We propose a stacked 1D convolutional neural network (S1DCNN) for end-to-end small footprint voice trigger detection in a streaming scenario. Voice trigger detection is an important speech application, with which users can activate their devices by simply saying a keyword or phrase. Due to privacy and latency reasons, a voice trigger detection system should run on an always-on processor on device.… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

    Comments: Accepted to INTERSPEECH 2020

  9. arXiv:2008.02323  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

    Authors: Saurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra Dhir

    Abstract: We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layers with self-attention layers. Results on internal evaluation sets show that self-attention… ▽ More

    Submitted 5 August, 2020; originally announced August 2020.

    Comments: INTERSPEECH, 2020

  10. arXiv:2003.06227  [pdf, other

    eess.AS cs.CV cs.IT cs.LG cs.SD

    Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis

    Authors: Ting-Yao Hu, Ashish Shrivastava, Oncel Tuzel, Chandra Dhir

    Abstract: We present a method to generate speech from input text and a style vector that is extracted from a reference speech signal in an unsupervised manner, i.e., no style annotation, such as speaker information, is required. Existing unsupervised methods, during training, generate speech by computing style from the corresponding ground truth sample and use a decoder to combine the style vector with the… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

    Comments: Accepted at ICASSP 2020 (for presentation in a lecture session)

  11. arXiv:1509.03475  [pdf, ps, other

    cs.LG cs.NE stat.ML

    Hessian-free Optimization for Learning Deep Multidimensional Recurrent Neural Networks

    Authors: Minhyung Cho, Chandra Shekhar Dhir, Jaehyung Lee

    Abstract: Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of… ▽ More

    Submitted 23 October, 2015; v1 submitted 11 September, 2015; originally announced September 2015.

    Comments: to appear at NIPS 2015