Skip to main content

Showing 1–32 of 32 results for author: Kalinli, O

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.18108  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Token-Weighted RNN-T for Learning from Flawed Data

    Authors: Gil Keren, Wei Zhou, Ozlem Kalinli

    Abstract: ASR models are commonly trained with the cross-entropy criterion to increase the probability of a target token sequence. While optimizing the probability of all tokens in the target sequence is sensible, one may want to de-emphasize tokens that reflect transcription errors. In this work, we propose a novel token-weighted RNN-T criterion that augments the RNN-T objective with token-specific weights… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  2. arXiv:2404.01716  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Effective internal language model training and fusion for factorized transducer model

    Authors: **xi Guo, Niko Moritz, Yingyi Ma, Frank Seide, Chunyang Wu, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

    Abstract: The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Accepted to ICASSP 2024

  3. arXiv:2310.11003  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Correction Focused Language Model Training for Speech Recognition

    Authors: Yingyi Ma, Zhe Liu, Ozlem Kalinli

    Abstract: Language models (LMs) have been commonly adopted to boost the performance of automatic speech recognition (ASR) particularly in domain adaptation tasks. Conventional way of LM training treats all the words in corpora equally, resulting in suboptimal improvements in ASR performance. In this work, we introduce a novel correction focused LM training approach which aims to prioritize ASR fallible word… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

  4. arXiv:2309.13018  [pdf, other

    eess.AS cs.CL cs.SD

    Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

    Authors: Jiamin Xie, Ke Li, **xi Guo, Andros Tjandra, Yuan Shangguan, Leda Sari, Chunyang Wu, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli

    Abstract: Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in… ▽ More

    Submitted 11 January, 2024; v1 submitted 22 September, 2023; originally announced September 2023.

  5. arXiv:2309.10917  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    End-to-End Speech Recognition Contextualization with Large Language Models

    Authors: Egor Lakomkin, Chunyang Wu, Yassir Fathullah, Ozlem Kalinli, Michael L. Seltzer, Christian Fuegen

    Abstract: In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We pro… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  6. arXiv:2309.09390  [pdf, other

    cs.CL cs.SD eess.AS

    Augmenting text for spoken language understanding with Large Language Models

    Authors: Roshan Sharma, Suyoun Kim, Daniel Lazar, Trang Le, Akshat Shrivastava, Kwanghoon Ahn, Piyush Kansal, Leda Sari, Ozlem Kalinli, Michael Seltzer

    Abstract: Spoken semantic parsing (SSP) involves generating machine-comprehensible parses from input speech. Training robust models for existing application domains represented in training data or extending to new domains requires corresponding triplets of speech-transcript-semantic parse data, which is expensive to obtain. In this paper, we address this challenge by examining methods that can use transcrip… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  7. arXiv:2309.01947  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

    Authors: Yuan Shangguan, Haichuan Yang, Danni Li, Chunyang Wu, Yassir Fathullah, Dilin Wang, Ayushi Dalmia, Raghuraman Krishnamoorthi, Ozlem Kalinli, Junteng Jia, Jay Mahadeokar, Xin Lei, Mike Seltzer, Vikas Chandra

    Abstract: Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficien… ▽ More

    Submitted 27 November, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: Meta AI; Submitted to ICASSP 2024

  8. arXiv:2309.00723  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Contextual Biasing of Named-Entities with Large Language Models

    Authors: Chuanneng Sun, Zeeshan Ahmed, Yingyi Ma, Zhe Liu, Lucas Kabela, Yutong Pang, Ozlem Kalinli

    Abstract: This paper studies contextual biasing with Large Language Models (LLMs), where during second-pass rescoring additional contextual information is provided to a LLM to boost Automatic Speech Recognition (ASR) performance. We propose to leverage prompts for a LLM without fine tuning during rescoring which incorporate a biasing list and few-shot examples to serve as additional information when calcula… ▽ More

    Submitted 21 September, 2023; v1 submitted 1 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures. Conference: ICASSP 2024

    MSC Class: 68T10 ACM Class: I.2.7

  9. arXiv:2307.12134  [pdf, other

    cs.CL cs.SD eess.AS

    Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

    Authors: Suyoun Kim, Akshat Shrivastava, Duc Le, Ju Lin, Ozlem Kalinli, Michael L. Seltzer

    Abstract: End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR), and outperforms traditional pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems still show weakness wh… ▽ More

    Submitted 22 July, 2023; originally announced July 2023.

    Comments: INTERSPEECH 2023

  10. arXiv:2307.11795  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Prompting Large Language Models with Speech Recognition Abilities

    Authors: Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, **xi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

    Abstract: Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings,… ▽ More

    Submitted 21 July, 2023; originally announced July 2023.

  11. arXiv:2306.00998  [pdf, other

    eess.AS cs.CL cs.SD

    Towards Selection of Text-to-speech Data to Augment ASR Training

    Authors: Shuo Liu, Leda Sarı, Chunyang Wu, Gil Keren, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli

    Abstract: This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic speech recognition (ASR) model. We trained a neural network, which can be optimised using cross-entropy loss or Arcface loss, to measure the similarity of a synthetic data to real speech. We found that incorporating syntheti… ▽ More

    Submitted 30 May, 2023; originally announced June 2023.

  12. arXiv:2305.12498  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Multi-Head State Space Model for Speech Recognition

    Authors: Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, Mark J. F. Gales

    Abstract: State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in… ▽ More

    Submitted 25 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023

  13. arXiv:2212.07650  [pdf, other

    eess.AS

    Improving Fast-slow Encoder based Transducer with Streaming Deliberation

    Authors: Ke Li, Jay Mahadeokar, **xi Guo, Yangyang Shi, Gil Keren, Ozlem Kalinli, Michael L. Seltzer, Duc Le

    Abstract: This paper introduces a fast-slow encoder based transducer with streaming deliberation for end-to-end automatic speech recognition. We aim to improve the recognition accuracy of the fast-slow encoder based transducer while kee** its latency low by integrating a streaming deliberation model. Specifically, the deliberation model leverages partial hypotheses from the streaming fast encoder and impl… ▽ More

    Submitted 15 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  14. arXiv:2211.05756  [pdf, other

    cs.CL cs.SD eess.AS

    Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities

    Authors: Andros Tjandra, Nayan Singhal, David Zhang, Ozlem Kalinli, Abdelrahman Mohamed, Duc Le, Michael L. Seltzer

    Abstract: End-to-end multilingual ASR has become more appealing because of several reasons such as simplifying the training and deployment process and positive performance transfer from high-resource to low-resource languages. However, scaling up the number of languages, total hours, and number of unique tokens is not a trivial task. This paper explores large-scale multilingual ASR models on 70 languages. W… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  15. arXiv:2211.00896  [pdf, other

    eess.AS cs.SD

    Factorized Blank Thresholding for Improved Runtime Efficiency of Neural Transducers

    Authors: Duc Le, Frank Seide, Yuhao Wang, Yang Li, Kjell Schubert, Ozlem Kalinli, Michael L. Seltzer

    Abstract: We show how factoring the RNN-T's output distribution can significantly reduce the computation cost and power consumption for on-device ASR inference with no loss in accuracy. With the rise in popularity of neural-transducer type models like the RNN-T for on-device ASR, optimizing RNN-T's runtime efficiency is of great interest. While previous work has primarily focused on the optimization of RNN-… ▽ More

    Submitted 4 March, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted for publication at ICASSP 2023

  16. arXiv:2211.00174  [pdf, other

    cs.CL cs.SD eess.AS

    Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition

    Authors: Suyoun Kim, Ke Li, Lucas Kabela, Rongqing Huang, Jiedan Zhu, Ozlem Kalinli, Duc Le

    Abstract: Recently, there has been an increasing interest in two-pass streaming end-to-end speech recognition (ASR) that incorporates a 2nd-pass rescoring model on top of the conventional 1st-pass streaming ASR model to improve recognition accuracy while kee** latency low. One of the latest 2nd-pass rescoring model, Transformer Rescorer, takes the n-best initial outputs and audio embeddings from the 1st-p… ▽ More

    Submitted 31 October, 2022; originally announced November 2022.

    Journal ref: Findings of EMNLP 2022 short

  17. arXiv:2210.11588  [pdf, other

    eess.AS cs.SD

    Anchored Speech Recognition with Neural Transducers

    Authors: Desh Raj, Junteng Jia, Jay Mahadeokar, Chunyang Wu, Niko Moritz, Xiaohui Zhang, Ozlem Kalinli

    Abstract: Neural transducers have achieved human level performance on standard speech recognition benchmarks. However, their performance significantly degrades in the presence of cross-talk, especially when the primary speaker has a low signal-to-noise ratio. Anchored speech recognition refers to a class of methods that use information from an anchor segment (e.g., wake-words) to recognize device-directed s… ▽ More

    Submitted 29 March, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

    Comments: To appear at IEEE ICASSP 2023

  18. arXiv:2209.05735  [pdf, other

    eess.AS cs.CL

    Learning ASR pathways: A sparse multilingual ASR model

    Authors: Mu Yang, Andros Tjandra, Chunxi Liu, David Zhang, Duc Le, Ozlem Kalinli

    Abstract: Neural network pruning compresses automatic speech recognition (ASR) models effectively. However, in multilingual ASR, language-agnostic pruning may lead to severe performance drops on some languages because language-agnostic pruning masks may not fit all languages and discard important language-specific parameters. In this work, we present ASR pathways, a sparse multilingual ASR model that activa… ▽ More

    Submitted 28 September, 2023; v1 submitted 13 September, 2022; originally announced September 2022.

    Comments: Accepted by ICASSP 2023

  19. arXiv:2207.11906  [pdf, other

    eess.AS cs.CL cs.SD

    Learning a Dual-Mode Speech Recognition Model via Self-Pruning

    Authors: Chunxi Liu, Yuan Shangguan, Haichuan Yang, Yangyang Shi, Raghuraman Krishnamoorthi, Ozlem Kalinli

    Abstract: There is growing interest in unifying the streaming and full-context automatic speech recognition (ASR) networks into a single end-to-end ASR model to simplify the model training and deployment for both use cases. While in real-world ASR applications, the streaming ASR models typically operate under more storage and computational constraints - e.g., on embedded devices - than any server-side full-… ▽ More

    Submitted 6 October, 2022; v1 submitted 25 July, 2022; originally announced July 2022.

    Comments: 7 pages, 1 figure. Accepted for publication at IEEE Spoken Language Technology Workshop (SLT), 2022

  20. arXiv:2204.01893  [pdf, other

    cs.CL eess.AS

    Deliberation Model for On-Device Spoken Language Understanding

    Authors: Duc Le, Akshat Shrivastava, Paden Tomasello, Suyoun Kim, Aleksandr Livshits, Ozlem Kalinli, Michael L. Seltzer

    Abstract: We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU), where a streaming automatic speech recognition (ASR) model produces the first-pass hypothesis and a second-pass natural language understanding (NLU) component generates the semantic parse by conditioning on both ASR's text and audio embeddings. By formulating E2E SLU as a generalized decoder, ou… ▽ More

    Submitted 6 September, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

    Comments: Accepted for publication at INTERSPEECH 2022

  21. arXiv:2203.15966  [pdf, other

    cs.SD cs.CL eess.AS

    Federated Domain Adaptation for ASR with Full Self-Supervision

    Authors: Junteng Jia, Jay Mahadeokar, Weiyi Zheng, Yuan Shangguan, Ozlem Kalinli, Frank Seide

    Abstract: Cross-device federated learning (FL) protects user privacy by collaboratively training a model on user devices, therefore eliminating the need for collecting, storing, and manually labeling user data. While important topics such as the FL training algorithm, non-IID-ness, and Differential Privacy have been well studied in the literature, this paper focuses on two challenges of practical importance… ▽ More

    Submitted 5 April, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

  22. arXiv:2203.15773  [pdf, other

    cs.CL cs.SD eess.AS

    Streaming parallel transducer beam search with fast-slow cascaded encoders

    Authors: Jay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas Chandra, Ozlem Kalinli, Michael L Seltzer

    Abstract: Streaming ASR with strict latency constraints is required in many speech recognition applications. In order to achieve the required latency, streaming ASR models sacrifice accuracy compared to non-streaming ASR models due to lack of future input context. Previous research has shown that streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders.… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: 5 pages, 2 figures, Interspeech 2022 submission

  23. arXiv:2201.11867  [pdf, other

    cs.CL cs.SD eess.AS

    Neural-FST Class Language Model for End-to-End Speech Recognition

    Authors: Antoine Bruguier, Duc Le, Rohit Prabhavalkar, Dangna Li, Zhe Liu, Bo Wang, Eun Chang, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer

    Abstract: We propose Neural-FST Class Language Model (NFCLM) for end-to-end speech recognition, a novel method that combines neural network language models (NNLMs) and finite state transducers (FSTs) in a mathematically consistent framework. Our method utilizes a background NNLM which models generic background text together with a collection of domain-specific entities modeled as individual FSTs. Each outpu… ▽ More

    Submitted 31 January, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

    Comments: Accepted for publication at ICASSP 2022

  24. arXiv:2111.05948  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling ASR Improves Zero and Few Shot Learning

    Authors: Alex Xiao, Weiyi Zheng, Gil Keren, Duc Le, Frank Zhang, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Abdelrahman Mohamed

    Abstract: With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such a… ▽ More

    Submitted 29 November, 2021; v1 submitted 10 November, 2021; originally announced November 2021.

  25. arXiv:2110.08352  [pdf, other

    cs.SD cs.CL eess.AS

    Omni-sparsity DNN: Fast Sparsity Optimization for On-Device Streaming E2E ASR via Supernet

    Authors: Haichuan Yang, Yuan Shangguan, Dilin Wang, Meng Li, Pierce Chuang, Xiaohui Zhang, Ganesh Venkatesh, Ozlem Kalinli, Vikas Chandra

    Abstract: From wearables to powerful smart devices, modern automatic speech recognition (ASR) models run on a variety of edge devices with different computational budgets. To navigate the Pareto front of model accuracy vs model size, researchers are trapped in a dilemma of optimizing model accuracy by training and fine-tuning models for each individual edge device while kee** the training GPU-hours tracta… ▽ More

    Submitted 20 July, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

  26. arXiv:2110.05241  [pdf, other

    eess.AS cs.CL cs.LG

    Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution

    Authors: Yangyang Shi, Chunyang Wu, Dilin Wang, Alex Xiao, Jay Mahadeokar, Xiaohui Zhang, Chunxi Liu, Ke Li, Yuan Shangguan, Varun Nagaraja, Ozlem Kalinli, Mike Seltzer

    Abstract: This paper improves the streaming transformer transducer for speech recognition by using non-causal convolution. Many works apply the causal convolution to improve streaming transformer ignoring the lookahead context. We propose to use non-causal convolution to process the center block and lookahead context separately. This method leverages the lookahead context in convolution and maintains simila… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures, submit to ICASSP 2022

  27. arXiv:2110.03174  [pdf, other

    cs.SD cs.AI eess.AS

    Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study

    Authors: Dawei Liang, Yangyang Shi, Yun Wang, Nayan Singhal, Alex Xiao, Jonathan Shaw, Edison Thomaz, Ozlem Kalinli, Mike Seltzer

    Abstract: Detection of common events and scenes from audio is useful for extracting and understanding human contexts in daily life. Prior studies have shown that leveraging knowledge from a relevant domain is beneficial for a target acoustic event detection (AED) process. Inspired by the observation that many human-centered acoustic events in daily life involve voice elements, this paper investigates the po… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  28. arXiv:2106.08960  [pdf, other

    cs.CL cs.SD eess.AS

    Collaborative Training of Acoustic Encoders for Speech Recognition

    Authors: Varun Nagaraja, Yangyang Shi, Ganesh Venkatesh, Ozlem Kalinli, Michael L. Seltzer, Vikas Chandra

    Abstract: On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the knowledge shared between them. Joint training is also efficient since it reduces the redundancy in the training procedure's data handling operations. We propose a… ▽ More

    Submitted 13 July, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: INTERSPEECH 2021

  29. arXiv:2104.02232  [pdf, other

    cs.SD cs.CL eess.AS

    Flexi-Transducer: Optimizing Latency, Accuracy and Compute forMulti-Domain On-Device Scenarios

    Authors: Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

    Abstract: Often, the storage and computational constraints of embeddeddevices demand that a single on-device ASR model serve multiple use-cases / domains. In this paper, we propose aFlexibleTransducer(FlexiT) for on-device automatic speech recognition to flexibly deal with multiple use-cases / domains with different accuracy and latency requirements. Specifically, using a single compact model, FlexiT provid… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: Submitted to Interspeech 2021 (under review)

  30. arXiv:2104.02207  [pdf, other

    cs.SD cs.CL eess.AS

    Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

    Authors: Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

    Abstract: As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task. Apart from being accurate… ▽ More

    Submitted 11 August, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Proc. of Interspeech 2021

  31. arXiv:2104.02194  [pdf, other

    cs.CL cs.LG eess.AS

    Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion

    Authors: Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer

    Abstract: How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area. Previous solutions to this problem were either designed for specialized use cases that did not generalize well to open-domain scenarios, did not scale to large biasing lists, or underperformed on rare long-tail words. We address these limitations by proposing a novel solution that… ▽ More

    Submitted 11 June, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Accepted for presentation at INTERSPEECH 2021

  32. arXiv:1909.02667  [pdf, other

    eess.AS cs.CL cs.LG

    Bandwidth Embeddings for Mixed-bandwidth Speech Recognition

    Authors: Gautam Mantena, Ozlem Kalinli, Ossama Abdel-Hamid, Don McAllaster

    Abstract: In this paper, we tackle the problem of handling narrowband and wideband speech by building a single acoustic model (AM), also called mixed bandwidth AM. In the proposed approach, an auxiliary input feature is used to provide the bandwidth information to the model, and bandwidth embeddings are jointly learned as part of acoustic model training. Experimental evaluations show that using bandwidth em… ▽ More

    Submitted 5 September, 2019; originally announced September 2019.

    Comments: A part of this work is accepted in Interspeech 2019 https://interspeech2019.org