Skip to main content

Showing 1–26 of 26 results for author: Mahadeokar, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2404.01716  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Effective internal language model training and fusion for factorized transducer model

    Authors: **xi Guo, Niko Moritz, Yingyi Ma, Frank Seide, Chunyang Wu, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

    Abstract: The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Accepted to ICASSP 2024

  2. arXiv:2309.13018  [pdf, other

    eess.AS cs.CL cs.SD

    Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

    Authors: Jiamin Xie, Ke Li, **xi Guo, Andros Tjandra, Yuan Shangguan, Leda Sari, Chunyang Wu, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli

    Abstract: Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in… ▽ More

    Submitted 11 January, 2024; v1 submitted 22 September, 2023; originally announced September 2023.

  3. arXiv:2309.01947  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

    Authors: Yuan Shangguan, Haichuan Yang, Danni Li, Chunyang Wu, Yassir Fathullah, Dilin Wang, Ayushi Dalmia, Raghuraman Krishnamoorthi, Ozlem Kalinli, Junteng Jia, Jay Mahadeokar, Xin Lei, Mike Seltzer, Vikas Chandra

    Abstract: Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficien… ▽ More

    Submitted 27 November, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: Meta AI; Submitted to ICASSP 2024

  4. arXiv:2307.11795  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Prompting Large Language Models with Speech Recognition Abilities

    Authors: Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, **xi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

    Abstract: Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings,… ▽ More

    Submitted 21 July, 2023; originally announced July 2023.

  5. arXiv:2306.15687  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

    Authors: Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu

    Abstract: Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative… ▽ More

    Submitted 19 October, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

    Comments: Accepted to NeurIPS 2023

  6. arXiv:2306.00998  [pdf, other

    eess.AS cs.CL cs.SD

    Towards Selection of Text-to-speech Data to Augment ASR Training

    Authors: Shuo Liu, Leda Sarı, Chunyang Wu, Gil Keren, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli

    Abstract: This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic speech recognition (ASR) model. We trained a neural network, which can be optimised using cross-entropy loss or Arcface loss, to measure the similarity of a synthetic data to real speech. We found that incorporating syntheti… ▽ More

    Submitted 30 May, 2023; originally announced June 2023.

  7. arXiv:2305.12498  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Multi-Head State Space Model for Speech Recognition

    Authors: Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, Mark J. F. Gales

    Abstract: State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in… ▽ More

    Submitted 25 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023

  8. arXiv:2212.07650  [pdf, other

    eess.AS

    Improving Fast-slow Encoder based Transducer with Streaming Deliberation

    Authors: Ke Li, Jay Mahadeokar, **xi Guo, Yangyang Shi, Gil Keren, Ozlem Kalinli, Michael L. Seltzer, Duc Le

    Abstract: This paper introduces a fast-slow encoder based transducer with streaming deliberation for end-to-end automatic speech recognition. We aim to improve the recognition accuracy of the fast-slow encoder based transducer while kee** its latency low by integrating a streaming deliberation model. Specifically, the deliberation model leverages partial hypotheses from the streaming fast encoder and impl… ▽ More

    Submitted 15 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  9. arXiv:2210.14252  [pdf, other

    cs.SD eess.AS

    Dynamic Speech Endpoint Detection with Regression Targets

    Authors: Dawei Liang, Hang Su, Tarun Singh, Jay Mahadeokar, Shanil Puri, Jiedan Zhu, Edison Thomaz, Mike Seltzer

    Abstract: Interactive voice assistants have been widely used as input interfaces in various scenarios, e.g. on smart homes devices, wearables and on AR devices. Detecting the end of a speech query, i.e. speech end-pointing, is an important task for voice assistants to interact with users. Traditionally, speech end-pointing is based on pure classification methods along with arbitrary binary targets. In this… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

    Comments: Manuscript submitted to ICASSP 2023

  10. arXiv:2210.11588  [pdf, other

    eess.AS cs.SD

    Anchored Speech Recognition with Neural Transducers

    Authors: Desh Raj, Junteng Jia, Jay Mahadeokar, Chunyang Wu, Niko Moritz, Xiaohui Zhang, Ozlem Kalinli

    Abstract: Neural transducers have achieved human level performance on standard speech recognition benchmarks. However, their performance significantly degrades in the presence of cross-talk, especially when the primary speaker has a low signal-to-noise ratio. Anchored speech recognition refers to a class of methods that use information from an anchor segment (e.g., wake-words) to recognize device-directed s… ▽ More

    Submitted 29 March, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

    Comments: To appear at IEEE ICASSP 2023

  11. arXiv:2204.08858  [pdf, other

    eess.AS cs.SD

    An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

    Authors: Niko Moritz, Frank Seide, Duc Le, Jay Mahadeokar, Christian Fuegen

    Abstract: The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T). Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination,… ▽ More

    Submitted 21 October, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

    Comments: Accepted to SLT 2022

  12. arXiv:2203.15966  [pdf, other

    cs.SD cs.CL eess.AS

    Federated Domain Adaptation for ASR with Full Self-Supervision

    Authors: Junteng Jia, Jay Mahadeokar, Weiyi Zheng, Yuan Shangguan, Ozlem Kalinli, Frank Seide

    Abstract: Cross-device federated learning (FL) protects user privacy by collaboratively training a model on user devices, therefore eliminating the need for collecting, storing, and manually labeling user data. While important topics such as the FL training algorithm, non-IID-ness, and Differential Privacy have been well studied in the literature, this paper focuses on two challenges of practical importance… ▽ More

    Submitted 5 April, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

  13. arXiv:2203.15773  [pdf, other

    cs.CL cs.SD eess.AS

    Streaming parallel transducer beam search with fast-slow cascaded encoders

    Authors: Jay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas Chandra, Ozlem Kalinli, Michael L Seltzer

    Abstract: Streaming ASR with strict latency constraints is required in many speech recognition applications. In order to achieve the required latency, streaming ASR models sacrifice accuracy compared to non-streaming ASR models due to lack of future input context. Previous research has shown that streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders.… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: 5 pages, 2 figures, Interspeech 2022 submission

  14. arXiv:2110.15018  [pdf, other

    eess.AS cs.SD

    TorchAudio: Building Blocks for Audio and Speech Processing

    Authors: Yao-Yuan Yang, Moto Hira, Zhaoheng Ni, Anjali Chourdia, Artyom Astafurov, Caroline Chen, Ching-Feng Yeh, Christian Puhrsch, David Pollack, Dmitriy Genzel, Donny Greenberg, Edward Z. Yang, Jason Lian, Jay Mahadeokar, Jeff Hwang, Ji Chen, Peter Goldsborough, Prabhat Roy, Sean Narenthiran, Shinji Watanabe, Soumith Chintala, Vincent Quenneville-Bélair, Yangyang Shi

    Abstract: This document describes version 0.10 of TorchAudio: building blocks for machine learning applications in the audio and speech processing domain. The objective of TorchAudio is to accelerate the development and deployment of machine learning applications for researchers and engineers by providing off-the-shelf building blocks. The building blocks are designed to be GPU-compatible, automatically dif… ▽ More

    Submitted 16 February, 2022; v1 submitted 28 October, 2021; originally announced October 2021.

    Comments: Accepted by ICASSP 2022

  15. arXiv:2110.05241  [pdf, other

    eess.AS cs.CL cs.LG

    Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution

    Authors: Yangyang Shi, Chunyang Wu, Dilin Wang, Alex Xiao, Jay Mahadeokar, Xiaohui Zhang, Chunxi Liu, Ke Li, Yuan Shangguan, Varun Nagaraja, Ozlem Kalinli, Mike Seltzer

    Abstract: This paper improves the streaming transformer transducer for speech recognition by using non-causal convolution. Many works apply the causal convolution to improve streaming transformer ignoring the lookahead context. We propose to use non-causal convolution to process the center block and lookahead context separately. This method leverages the lookahead context in convolution and maintains simila… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures, submit to ICASSP 2022

  16. arXiv:2104.02232  [pdf, other

    cs.SD cs.CL eess.AS

    Flexi-Transducer: Optimizing Latency, Accuracy and Compute forMulti-Domain On-Device Scenarios

    Authors: Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

    Abstract: Often, the storage and computational constraints of embeddeddevices demand that a single on-device ASR model serve multiple use-cases / domains. In this paper, we propose aFlexibleTransducer(FlexiT) for on-device automatic speech recognition to flexibly deal with multiple use-cases / domains with different accuracy and latency requirements. Specifically, using a single compact model, FlexiT provid… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: Submitted to Interspeech 2021 (under review)

  17. arXiv:2104.02207  [pdf, other

    cs.SD cs.CL eess.AS

    Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

    Authors: Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

    Abstract: As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task. Apart from being accurate… ▽ More

    Submitted 11 August, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Proc. of Interspeech 2021

  18. arXiv:2104.02194  [pdf, other

    cs.CL cs.LG eess.AS

    Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion

    Authors: Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer

    Abstract: How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area. Previous solutions to this problem were either designed for specialized use cases that did not generalize well to open-domain scenarios, did not scale to large biasing lists, or underperformed on rare long-tail words. We address these limitations by proposing a novel solution that… ▽ More

    Submitted 11 June, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Accepted for presentation at INTERSPEECH 2021

  19. arXiv:2102.11531  [pdf, other

    cs.SD cs.CL eess.AS

    Memory-efficient Speech Recognition on Smart Devices

    Authors: Ganesh Venkatesh, Alagappan Valliappan, Jay Mahadeokar, Yuan Shangguan, Christian Fuegen, Michael L. Seltzer, Vikas Chandra

    Abstract: Recurrent transducer models have emerged as a promising solution for speech recognition on the current and next generation smart devices. The transducer models provide competitive accuracy within a reasonable memory footprint alleviating the memory capacity constraints in these devices. However, these models access parameters from off-chip memory for every input time step which adversely effects d… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

    Journal ref: ICASSP 2021

  20. arXiv:2011.07754  [pdf, other

    cs.CL eess.AS

    Deep Shallow Fusion for RNN-T Personalization

    Authors: Duc Le, Gil Keren, Julian Chan, Jay Mahadeokar, Christian Fuegen, Michael L. Seltzer

    Abstract: End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in particular, have gained significant traction in the automatic speech recognition community in the last few years due to their simplicity, compactness, and excellent performance on generic transcription tasks. However, these models are more challenging to personalize compared to traditional hybrid systems due to the la… ▽ More

    Submitted 16 November, 2020; originally announced November 2020.

    Comments: To appear at SLT 2021

  21. arXiv:2011.03072  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Alignment Restricted Streaming Recurrent Neural Network Transducer

    Authors: Jay Mahadeokar, Yuan Shangguan, Duc Le, Gil Keren, Hang Su, Thong Le, Ching-Feng Yeh, Christian Fuegen, Michael L. Seltzer

    Abstract: There is a growing interest in the speech community in develo** Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for lon… ▽ More

    Submitted 5 November, 2020; originally announced November 2020.

    Comments: Accepted for presentation at IEEE Spoken Language Technology Workshop (SLT) 2021

  22. arXiv:2006.03411  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Contextual RNN-T For Open Domain ASR

    Authors: Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf

    Abstract: End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system - acoustic model, language model, pronunciation model - into a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this… ▽ More

    Submitted 12 August, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

  23. arXiv:1911.02115  [pdf, ps, other

    eess.AS cs.SD

    Spatial Attention for Far-field Speech Recognition with Deep Beamforming Neural Networks

    Authors: Weipeng He, Lu Lu, Biqiao Zhang, Jay Mahadeokar, Kaustubh Kalgaonkar, Christian Fuegen

    Abstract: In this paper, we introduce spatial attention for refining the information in multi-direction neural beamformer for far-field automatic speech recognition. Previous approaches of neural beamformers with multiple look directions, such as the factored complex linear projection, have shown promising results. However, the features extracted by such methods contain redundant information, as only the di… ▽ More

    Submitted 9 March, 2020; v1 submitted 5 November, 2019; originally announced November 2019.

    Comments: To be presented at ICASSP 2020

  24. arXiv:1911.01629  [pdf, other

    cs.CL cs.LG eess.AS

    RNN-T For Latency Controlled ASR With Improved Beam Search

    Authors: Mahaveer Jain, Kjell Schubert, Jay Mahadeokar, Ching-Feng Yeh, Kaustubh Kalgaonkar, Anuroop Sriram, Christian Fuegen, Michael L. Seltzer

    Abstract: Neural transducer-based systems such as RNN Transducers (RNN-T) for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR systems (acoustic model, language model, punctuation model, inverse text normalization) into one single model. This greatly simplifies training and inference and hence makes RNN-T a desirable choice for ASR systems. In this work, we inve… ▽ More

    Submitted 16 January, 2020; v1 submitted 5 November, 2019; originally announced November 2019.

  25. arXiv:1910.12977  [pdf, other

    eess.AS cs.CL cs.SD

    Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

    Authors: Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer

    Abstract: We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-at… ▽ More

    Submitted 28 October, 2019; originally announced October 2019.

  26. Transformer-based Acoustic Modeling for Hybrid Speech Recognition

    Authors: Yongqiang Wang, Abdelrahman Mohamed, Duc Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui Zhang, Frank Zhang, Christian Fuegen, Geoffrey Zweig, Michael L. Seltzer

    Abstract: We propose and evaluate transformer-based acoustic models (AMs) for hybrid speech recognition. Several modeling choices are discussed in this work, including various positional embedding methods and an iterated loss to enable training deep transformers. We also present a preliminary study of using limited right context in transformer models, which makes it possible for streaming applications. We d… ▽ More

    Submitted 29 April, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: to appear in ICASSP 2020