Skip to main content

Showing 1–10 of 10 results for author: Sridhar, P

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.09345  [pdf, other

    cs.CL cs.SD eess.AS

    DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding

    Authors: Suwon Shon, Kwangyoun Kim, Yi-Te Hsu, Prashant Sridhar, Shinji Watanabe, Karen Livescu

    Abstract: The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to t… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  2. arXiv:2401.08835  [pdf, other

    cs.CL eess.AS

    Improving ASR Contextual Biasing with Guided Attention

    Authors: Jiyang Tang, Kwangyoun Kim, Suwon Shon, Felix Wu, Prashant Sridhar, Shinji Watanabe

    Abstract: In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters. A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases. To addres… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  3. arXiv:2312.09895  [pdf, other

    cs.CL cs.SD eess.AS

    Generative Context-aware Fine-tuning of Self-supervised Speech Models

    Authors: Suwon Shon, Kwangyoun Kim, Prashant Sridhar, Yi-Te Hsu, Shinji Watanabe, Karen Livescu

    Abstract: When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, L… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  4. arXiv:2305.11073  [pdf, other

    cs.CL cs.SD eess.AS

    A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

    Authors: Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, Shinji Watanabe

    Abstract: Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation (ST) and spoken language understanding (SLU). Recently, a new encoder called E-Branchformer has outperformed Conformer in the LibriSpeech ASR benchmark, making it… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted at INTERSPEECH 2023. Code: https://github.com/espnet/espnet

  5. arXiv:2302.14132  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Structured Pruning of Self-Supervised Pre-trained Models for Speech Recognition and Understanding

    Authors: Yifan Peng, Kwangyoun Kim, Felix Wu, Prashant Sridhar, Shinji Watanabe

    Abstract: Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow. Model compression techniques such as pruning aim to reduce the model size and computation without degradation in accuracy. Prior studies focus on the pruning of Transformers; however, speech models not only utilize a stack of Transformer blocks, but… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Comments: Accepted at ICASSP 2023

  6. arXiv:2212.08542  [pdf, other

    eess.AS cs.CL

    Context-aware Fine-tuning of Self-supervised Speech Models

    Authors: Suwon Shon, Felix Wu, Kwangyoun Kim, Prashant Sridhar, Karen Livescu, Shinji Watanabe

    Abstract: Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tu… ▽ More

    Submitted 28 March, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

  7. arXiv:2210.00077  [pdf, other

    eess.AS cs.LG

    E-Branchformer: Branchformer with Enhanced merging for speech recognition

    Authors: Kwangyoun Kim, Felix Wu, Yifan Peng, **g Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

    Abstract: Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Bra… ▽ More

    Submitted 14 October, 2022; v1 submitted 30 September, 2022; originally announced October 2022.

    Comments: Accepted to SLT 2022

  8. arXiv:2106.09760  [pdf, other

    eess.AS cs.CL cs.SD

    Multi-mode Transformer Transducer with Stochastic Future Context

    Authors: Kwangyoun Kim, Felix Wu, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

    Abstract: Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed and accuracy. Naively, to fit different latency requirements, people have to store multiple models and pick the best one under the constraints. Inste… ▽ More

    Submitted 17 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021

  9. arXiv:1811.12290  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Tuplemax Loss for Language Identification

    Authors: Li Wan, Prashant Sridhar, Yang Yu, Quan Wang, Ignacio Lopez Moreno

    Abstract: In many scenarios of a language identification task, the user will specify a small set of languages which he/she can speak instead of a large set of all possible languages. We want to model such prior knowledge into the way we train our neural networks, by replacing the commonly used softmax loss function with a novel loss function named tuplemax loss. As a matter of fact, a typical language ident… ▽ More

    Submitted 17 February, 2019; v1 submitted 29 November, 2018; originally announced November 2018.

    Comments: Submitted to ICASSP 2019

  10. arXiv:1810.04826  [pdf, other

    eess.AS cs.LG eess.SP stat.ML

    VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

    Authors: Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno

    Abstract: In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embe… ▽ More

    Submitted 19 June, 2019; v1 submitted 10 October, 2018; originally announced October 2018.

    Comments: To appear in Interspeech 2019