Skip to main content

Showing 1–35 of 35 results for author: Fuegen, C

.
  1. arXiv:2404.01716  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Effective internal language model training and fusion for factorized transducer model

    Authors: **xi Guo, Niko Moritz, Yingyi Ma, Frank Seide, Chunyang Wu, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

    Abstract: The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Accepted to ICASSP 2024

  2. arXiv:2401.10411  [pdf, other

    eess.AS cs.SD

    AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition

    Authors: Ju Lin, Niko Moritz, Yiteng Huang, Ruiming Xie, Ming Sun, Christian Fuegen, Frank Seide

    Abstract: Wearable devices like smart glasses are approaching the compute capability to seamlessly generate real-time closed captions for live conversations. We build on our recently introduced directional Automatic Speech Recognition (ASR) for smart glasses that have microphone arrays, which fuses multi-channel ASR with serialized output training, for wearer/conversation-partner disambiguation as well as s… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  3. arXiv:2311.06753  [pdf, other

    cs.CL cs.AI

    AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs

    Authors: Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Ke Li, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

    Abstract: In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of original LLM capabilities, without using any carefully curated paired data. The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation. Such a model also ha… ▽ More

    Submitted 12 April, 2024; v1 submitted 12 November, 2023; originally announced November 2023.

  4. arXiv:2309.10917  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    End-to-End Speech Recognition Contextualization with Large Language Models

    Authors: Egor Lakomkin, Chunyang Wu, Yassir Fathullah, Ozlem Kalinli, Michael L. Seltzer, Christian Fuegen

    Abstract: In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We pro… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  5. arXiv:2307.11795  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Prompting Large Language Models with Speech Recognition Abilities

    Authors: Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, **xi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

    Abstract: Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings,… ▽ More

    Submitted 21 July, 2023; originally announced July 2023.

  6. arXiv:2303.17200  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

    Authors: Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, **chuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen

    Abstract: Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems wit… ▽ More

    Submitted 3 April, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: IEEE/CVF CVPR 2023

  7. arXiv:2211.02133  [pdf, other

    eess.AS cs.CV cs.SD

    Streaming Audio-Visual Speech Recognition with Alignment Regularization

    Authors: **chuan Ma, Niko Moritz, Stavros Petridis, Christian Fuegen, Maja Pantic

    Abstract: In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer architecture, which is made streamable using chunk-wise self-attention (CSA) and causal convolution. Streaming recognition with a decoder neural network is realized by… ▽ More

    Submitted 1 July, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

    Comments: Accepted to Interspeech 2023

  8. arXiv:2204.08858  [pdf, other

    eess.AS cs.SD

    An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

    Authors: Niko Moritz, Frank Seide, Duc Le, Jay Mahadeokar, Christian Fuegen

    Abstract: The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T). Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination,… ▽ More

    Submitted 21 October, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

    Comments: Accepted to SLT 2022

  9. arXiv:2111.05948  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling ASR Improves Zero and Few Shot Learning

    Authors: Alex Xiao, Weiyi Zheng, Gil Keren, Duc Le, Frank Zhang, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Abdelrahman Mohamed

    Abstract: With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such a… ▽ More

    Submitted 29 November, 2021; v1 submitted 10 November, 2021; originally announced November 2021.

  10. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  11. arXiv:2110.05376  [pdf, other

    cs.CL

    Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric

    Authors: Suyoun Kim, Duc Le, Weiyi Zheng, Tarun Singh, Abhinav Arora, Xiaoyu Zhai, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer

    Abstract: Measuring automatic speech recognition (ASR) system quality is critical for creating user-satisfying voice-driven applications. Word Error Rate (WER) has been traditionally used to evaluate ASR system quality; however, it sometimes correlates poorly with user perception/judgement of transcription quality. This is because WER weighs every word equally and does not consider semantic correctness whic… ▽ More

    Submitted 5 July, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: INTERSPEECH 2022

  12. arXiv:2106.11335  [pdf, other

    cs.SD cs.AI eess.AS

    Do sound event representations generalize to other audio tasks? A case study in audio transfer learning

    Authors: Anurag Kumar, Yun Wang, Vamsi Krishna Ithapu, Christian Fuegen

    Abstract: Transfer learning is critical for efficient information transfer across multiple related learning problems. A simple, yet effective transfer learning approach utilizes deep neural networks trained on a large-scale task for feature extraction. Such representations are then used to learn related downstream tasks. In this paper, we investigate transfer learning capacity of audio representations obtai… ▽ More

    Submitted 21 June, 2021; originally announced June 2021.

    Comments: Accepted Interspeech 2021

  13. arXiv:2104.02232  [pdf, other

    cs.SD cs.CL eess.AS

    Flexi-Transducer: Optimizing Latency, Accuracy and Compute forMulti-Domain On-Device Scenarios

    Authors: Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

    Abstract: Often, the storage and computational constraints of embeddeddevices demand that a single on-device ASR model serve multiple use-cases / domains. In this paper, we propose aFlexibleTransducer(FlexiT) for on-device automatic speech recognition to flexibly deal with multiple use-cases / domains with different accuracy and latency requirements. Specifically, using a single compact model, FlexiT provid… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: Submitted to Interspeech 2021 (under review)

  14. arXiv:2104.02207  [pdf, other

    cs.SD cs.CL eess.AS

    Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

    Authors: Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

    Abstract: As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task. Apart from being accurate… ▽ More

    Submitted 11 August, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Proc. of Interspeech 2021

  15. arXiv:2104.02194  [pdf, other

    cs.CL cs.LG eess.AS

    Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion

    Authors: Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer

    Abstract: How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area. Previous solutions to this problem were either designed for specialized use cases that did not generalize well to open-domain scenarios, did not scale to large biasing lists, or underperformed on rare long-tail words. We address these limitations by proposing a novel solution that… ▽ More

    Submitted 11 June, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Accepted for presentation at INTERSPEECH 2021

  16. arXiv:2104.02176  [pdf, other

    cs.CL

    Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency

    Authors: Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer

    Abstract: We propose a dynamic encoder transducer (DET) for on-device speech recognition. One DET model scales to multiple devices with different computation capacities without retraining or finetuning. To trading off accuracy and latency, DET assigns different encoders to decode different parts of an utterance. We apply and compare the layer dropout and the collaborative learning for DET training. The laye… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: 5 pages, 2 figures, submitted Interspeech 2021

  17. arXiv:2104.02138  [pdf, other

    cs.CL

    Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding

    Authors: Suyoun Kim, Abhinav Arora, Duc Le, Ching-Feng Yeh, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer

    Abstract: Word Error Rate (WER) has been the predominant metric used to evaluate the performance of automatic speech recognition (ASR) systems. However, WER is sometimes not a good indicator for downstream Natural Language Understanding (NLU) tasks, such as intent recognition, slot filling, and semantic parsing in task-oriented dialog systems. This is because WER takes into consideration only literal correc… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: submitted to Interspeech 2021

  18. arXiv:2103.05149  [pdf, ps, other

    cs.CL

    Contrastive Semi-supervised Learning for ASR

    Authors: Alex Xiao, Christian Fuegen, Abdelrahman Mohamed

    Abstract: Pseudo-labeling is the most adopted method for pre-training automatic speech recognition (ASR) models. However, its performance suffers from the supervised teacher model's degrading quality in low-resource setups and under domain transfer. Inspired by the successes of contrastive representation learning for computer vision and speech applications, and more recently for supervised learning of visua… ▽ More

    Submitted 8 March, 2021; originally announced March 2021.

  19. arXiv:2102.11531  [pdf, other

    cs.SD cs.CL eess.AS

    Memory-efficient Speech Recognition on Smart Devices

    Authors: Ganesh Venkatesh, Alagappan Valliappan, Jay Mahadeokar, Yuan Shangguan, Christian Fuegen, Michael L. Seltzer, Vikas Chandra

    Abstract: Recurrent transducer models have emerged as a promising solution for speech recognition on the current and next generation smart devices. The transducer models provide competitive accuracy within a reasonable memory footprint alleviating the memory capacity constraints in these devices. However, these models access parameters from off-chip memory for every input time step which adversely effects d… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

    Journal ref: ICASSP 2021

  20. arXiv:2011.07754  [pdf, other

    cs.CL eess.AS

    Deep Shallow Fusion for RNN-T Personalization

    Authors: Duc Le, Gil Keren, Julian Chan, Jay Mahadeokar, Christian Fuegen, Michael L. Seltzer

    Abstract: End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in particular, have gained significant traction in the automatic speech recognition community in the last few years due to their simplicity, compactness, and excellent performance on generic transcription tasks. However, these models are more challenging to personalize compared to traditional hybrid systems due to the la… ▽ More

    Submitted 16 November, 2020; originally announced November 2020.

    Comments: To appear at SLT 2021

  21. arXiv:2011.03072  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Alignment Restricted Streaming Recurrent Neural Network Transducer

    Authors: Jay Mahadeokar, Yuan Shangguan, Duc Le, Gil Keren, Hang Su, Thong Le, Ching-Feng Yeh, Christian Fuegen, Michael L. Seltzer

    Abstract: There is a growing interest in the speech community in develo** Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for lon… ▽ More

    Submitted 5 November, 2020; originally announced November 2020.

    Comments: Accepted for presentation at IEEE Spoken Language Technology Workshop (SLT) 2021

  22. arXiv:2010.13878  [pdf, ps, other

    cs.CL

    Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

    Authors: Suyoun Kim, Yuan Shangguan, Jay Mahadeokar, Antoine Bruguier, Christian Fuegen, Michael L. Seltzer, Duc Le

    Abstract: Recurrent Neural Network Transducer (RNN-T), like most end-to-end speech recognition model architectures, has an implicit neural network language model (NNLM) and cannot easily leverage unpaired text data during training. Previous work has proposed various fusion methods to incorporate external NNLMs into end-to-end ASR to address this weakness. In this paper, we propose extensions to these techni… ▽ More

    Submitted 26 October, 2020; originally announced October 2020.

    Comments: submitted to ICASSP 2021

  23. arXiv:2005.09137  [pdf, other

    eess.AS cs.CL

    Weak-Attention Suppression For Transformer Based Speech Recognition

    Authors: Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank Zhang, Duc Le, Ching-Feng Yeh, Michael L. Seltzer

    Abstract: Transformers, originally proposed for natural language processing (NLP) tasks, have recently achieved great success in automatic speech recognition (ASR). However, adjacent acoustic units (i.e., frames) are highly correlated, and long-distance dependencies between them are weak, unlike text units. It suggests that ASR will likely benefit from sparse and localized attention. In this paper, we propo… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

    Comments: submitted to interspeech 2020

  24. arXiv:2005.07850  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Large scale weakly and semi-supervised learning for low-resource video ASR

    Authors: Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on th… ▽ More

    Submitted 6 August, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

  25. arXiv:2002.06758  [pdf, other

    cs.SD eess.AS

    Interactive Text-to-Speech System via Joint Style Analysis

    Authors: Yang Gao, Weiyi Zheng, Zhaojun Yang, Thilo Kohler, Christian Fuegen, Qing He

    Abstract: While modern TTS technologies have made significant advancements in audio quality, there is still a lack of behavior naturalness compared to conversing with people. We propose a style-embedded TTS system that generates styled responses based on the speech query style. To achieve this, the system includes a style extraction model that extracts a style embedding from the speech query, which is then… ▽ More

    Submitted 21 September, 2020; v1 submitted 16 February, 2020; originally announced February 2020.

    Comments: Accepted by Interspeech 2020

  26. Libri-Light: A Benchmark for ASR with Limited or No Supervision

    Authors: Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR… ▽ More

    Submitted 17 December, 2019; originally announced December 2019.

  27. arXiv:1911.02115  [pdf, ps, other

    eess.AS cs.SD

    Spatial Attention for Far-field Speech Recognition with Deep Beamforming Neural Networks

    Authors: Weipeng He, Lu Lu, Biqiao Zhang, Jay Mahadeokar, Kaustubh Kalgaonkar, Christian Fuegen

    Abstract: In this paper, we introduce spatial attention for refining the information in multi-direction neural beamformer for far-field automatic speech recognition. Previous approaches of neural beamformers with multiple look directions, such as the factored complex linear projection, have shown promising results. However, the features extracted by such methods contain redundant information, as only the di… ▽ More

    Submitted 9 March, 2020; v1 submitted 5 November, 2019; originally announced November 2019.

    Comments: To be presented at ICASSP 2020

  28. arXiv:1911.01629  [pdf, other

    cs.CL cs.LG eess.AS

    RNN-T For Latency Controlled ASR With Improved Beam Search

    Authors: Mahaveer Jain, Kjell Schubert, Jay Mahadeokar, Ching-Feng Yeh, Kaustubh Kalgaonkar, Anuroop Sriram, Christian Fuegen, Michael L. Seltzer

    Abstract: Neural transducer-based systems such as RNN Transducers (RNN-T) for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR systems (acoustic model, language model, punctuation model, inverse text normalization) into one single model. This greatly simplifies training and inference and hence makes RNN-T a desirable choice for ASR systems. In this work, we inve… ▽ More

    Submitted 16 January, 2020; v1 submitted 5 November, 2019; originally announced November 2019.

  29. arXiv:1910.12977  [pdf, other

    eess.AS cs.CL cs.SD

    Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

    Authors: Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer

    Abstract: We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-at… ▽ More

    Submitted 28 October, 2019; originally announced October 2019.

  30. arXiv:1910.12612  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

    Authors: Duc Le, Thilo Koehler, Christian Fuegen, Michael L. Seltzer

    Abstract: Grapheme-based acoustic modeling has recently been shown to outperform phoneme-based approaches in both hybrid and end-to-end automatic speech recognition (ASR), even on non-phonemic languages like English. However, graphemic ASR still has problems with rare long-tail words that do not follow the standard spelling conventions seen in training, such as entity names. In this work, we present a novel… ▽ More

    Submitted 13 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: To appear at ICASSP 2020

  31. Transformer-based Acoustic Modeling for Hybrid Speech Recognition

    Authors: Yongqiang Wang, Abdelrahman Mohamed, Duc Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui Zhang, Frank Zhang, Christian Fuegen, Geoffrey Zweig, Michael L. Seltzer

    Abstract: We propose and evaluate transformer-based acoustic models (AMs) for hybrid speech recognition. Several modeling choices are discussed in this work, including various positional embedding methods and an iterated loss to enable training deep transformers. We also present a preliminary study of using limited right context in transformer models, which makes it possible for streaming applications. We d… ▽ More

    Submitted 29 April, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: to appear in ICASSP 2020

  32. arXiv:1910.01493  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition

    Authors: Duc Le, Xiaohui Zhang, Weiyi Zheng, Christian Fügen, Geoffrey Zweig, Michael L. Seltzer

    Abstract: There is an implicit assumption that traditional hybrid approaches for automatic speech recognition (ASR) cannot directly model graphemes and need to rely on phonetic lexicons to get competitive performance, especially on English which has poor grapheme-phoneme correspondence. In this work, we show for the first time that, on English, hybrid ASR systems can in fact model graphemes effectively by l… ▽ More

    Submitted 11 October, 2019; v1 submitted 2 October, 2019; originally announced October 2019.

    Comments: To appear at ASRU 2019

  33. arXiv:1812.02142  [pdf, other

    cs.CL cs.LG

    End-to-end contextual speech recognition using class language models and a token passing decoder

    Authors: Zhehuai Chen, Mahaveer Jain, Yongqiang Wang, Michael L. Seltzer, Christian Fuegen

    Abstract: End-to-end modeling (E2E) of automatic speech recognition (ASR) blends all the components of a traditional speech recognition system into a unified model. Although it simplifies training and decoding pipelines, the unified model is hard to adapt when mismatch exists between training and test data. In this work, we focus on contextual speech recognition, which is particularly challenging for E2E mo… ▽ More

    Submitted 5 December, 2018; originally announced December 2018.

    Comments: submit to ICASSP2019

    MSC Class: 68T10 ACM Class: I.2.7

  34. arXiv:1802.08395  [pdf, other

    cs.CL

    Towards end-to-end spoken language understanding

    Authors: Dmitriy Serdyuk, Yongqiang Wang, Christian Fuegen, Anuj Kumar, Baiyang Liu, Yoshua Bengio

    Abstract: Spoken language understanding system is traditionally designed as a pipeline of a number of components. First, the audio signal is processed by an automatic speech recognizer for transcription or n-best hypotheses. With the recognition results, a natural language understanding system classifies the text to structured data as domain, intent and slots for down-streaming consumers, such as dialog sys… ▽ More

    Submitted 23 February, 2018; originally announced February 2018.

    Comments: submitted to ICASSP 2018

  35. arXiv:1711.01369  [pdf, other

    cs.SD cs.MM eess.AS

    Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes

    Authors: Anurag Kumar, Maksim Khadkevich, Christian Fugen

    Abstract: In this work we propose approaches to effectively transfer knowledge from weakly labeled web audio data. We first describe a convolutional neural network (CNN) based framework for sound event detection and classification using weakly labeled audio data. Our model trains efficiently from audios of variable lengths; hence, it is well suited for transfer learning. We then propose methods to learn rep… ▽ More

    Submitted 7 September, 2018; v1 submitted 3 November, 2017; originally announced November 2017.

    Comments: ICASSP 2018