Skip to main content

Showing 1–9 of 9 results for author: Adya, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.09617  [pdf, other

    cs.CL cs.HC eess.AS

    Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

    Authors: Shruti Palaskar, Oggi Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed Tewfik

    Abstract: Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  2. arXiv:2406.09443  [pdf, other

    eess.AS cs.HC cs.LG

    Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

    Authors: Satyam Kumar, Sai Srujana Buddi, Utkarsh Oggy Sarawgi, Vineet Garg, Shivesh Ranjan, Ognjen, Rudovic, Ahmed Hussen Abdelaziz, Saurabh Adya

    Abstract: Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  3. arXiv:2310.15261  [pdf, ps, other

    cs.SD cs.HC cs.LG eess.AS

    Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

    Authors: Gautam Krishna, Sameer Dharur, Oggi Rudovic, Pranay Dighe, Saurabh Adya, Ahmed Hussen Abdelaziz, Ahmed H Tewfik

    Abstract: Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues, e.g acoustic, text and/or automatic speech recognition system (ASR) features, to classify speech as device-directed or otherwise, and often have to contend with one or… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: 5 pages

  4. arXiv:2303.08253  [pdf, other

    cs.LG cs.CV cs.PF eess.IV

    R2 Loss: Range Restriction Loss for Model Compression and Quantization

    Authors: Arnav Kundu, Chungkuk Yoo, Srijan Mishra, Minsik Cho, Saurabh Adya

    Abstract: Model quantization and compression is widely used techniques to reduce usage of computing resource at inference time. While state-of-the-art works have been achieved reasonable accuracy with higher bit such as 4bit or 8bit, but still it is challenging to quantize/compress a model further, e.g., 1bit or 2bit. To overcome the challenge, we focus on outliers in weights of a pre-trained model which di… ▽ More

    Submitted 11 February, 2024; v1 submitted 14 March, 2023; originally announced March 2023.

  5. arXiv:2204.02455  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Voice Trigger Detection with Metric Learning

    Authors: Prateeth Nayak, Takuya Higuchi, Anmol Gupta, Shivesh Ranjan, Stephen Shum, Siddharth Sigtia, Erik Marchi, Varun Lakshminarasimhan, Minsik Cho, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

    Abstract: Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented… ▽ More

    Submitted 13 September, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Accepted at InterSpeech 2022

  6. arXiv:2203.15975  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

    Authors: Vineet Garg, Ognjen Rudovic, Pranay Dighe, Ahmed H. Abdelaziz, Erik Marchi, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

    Abstract: We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to accidental button presses is critical for user experience. While the majority of approaches to false trigger mitigation (FTM) are designed to detect the presence of a t… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022

  7. arXiv:2105.06598  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation

    Authors: Vineet Garg, Wonil Chang, Siddharth Sigtia, Saurabh Adya, Pramod Simha, Pranay Dighe, Chandra Dhir

    Abstract: We present a unified and hardware efficient architecture for two stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two stage VTD systems of voice assistants can get falsely activated to audio segments acoustically similar to the trigger phrase of interest. FTM systems cancel such activations by using post trigger audio context. Traditional FTM systems rely on automatic… ▽ More

    Submitted 13 May, 2021; originally announced May 2021.

  8. arXiv:2008.02323  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

    Authors: Saurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra Dhir

    Abstract: We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layers with self-attention layers. Results on internal evaluation sets show that self-attention… ▽ More

    Submitted 5 August, 2020; originally announced August 2020.

    Comments: INTERSPEECH, 2020

  9. arXiv:2001.10822  [pdf, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    Lattice-based Improvements for Voice Triggering Using Graph Neural Networks

    Authors: Pranay Dighe, Saurabh Adya, Nuoyu Li, Srikanth Vishnubhotla, Devang Naik, Adithya Sagar, Ying Ma, Stephen Pulman, Jason Williams

    Abstract: Voice-triggered smart assistants often rely on detection of a trigger-phrase before they start listening for the user request. Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant. In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using… ▽ More

    Submitted 24 January, 2020; originally announced January 2020.