Skip to main content

Showing 1–13 of 13 results for author: Seide, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.09569  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time

    Authors: Frank Seide, Morrie Doulaty, Yangyang Shi, Yashesh Gaur, Junteng Jia, Chunyang Wu

    Abstract: We introduce Speech ReaLLM, a new ASR architecture that marries "decoder-only" ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the first "decoder-only" ASR architecture designed to handle continuous audio without explicit end-pointing. Speech ReaLLM is a special case of the more general ReaLLM ("real-time LLM") approach, also introduced here for the… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  2. arXiv:2404.01716  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Effective internal language model training and fusion for factorized transducer model

    Authors: **xi Guo, Niko Moritz, Yingyi Ma, Frank Seide, Chunyang Wu, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

    Abstract: The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Accepted to ICASSP 2024

  3. arXiv:2401.10411  [pdf, other

    eess.AS cs.SD

    AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition

    Authors: Ju Lin, Niko Moritz, Yiteng Huang, Ruiming Xie, Ming Sun, Christian Fuegen, Frank Seide

    Abstract: Wearable devices like smart glasses are approaching the compute capability to seamlessly generate real-time closed captions for live conversations. We build on our recently introduced directional Automatic Speech Recognition (ASR) for smart glasses that have microphone arrays, which fuses multi-channel ASR with serialized output training, for wearer/conversation-partner disambiguation as well as s… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  4. arXiv:2309.10993  [pdf, other

    cs.SD cs.HC eess.AS

    Directional Source Separation for Robust Speech Recognition on Smart Glasses

    Authors: Tiantian Feng, Ju Lin, Yiteng Huang, Weipeng He, Kaustubh Kalgaonkar, Niko Moritz, Li Wan, Xin Lei, Ming Sun, Frank Seide

    Abstract: Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequently encounter challenges related to environmental noises, resulting in degradation to speech recognition and speaker change detection. To improve voice quality,… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  5. arXiv:2308.13173  [pdf, other

    cs.CV cs.CL

    DISGO: Automatic End-to-End Evaluation for Scene Text OCR

    Authors: Mei-Yuh Hwang, Yangyang Shi, Ankit Ramchandani, Guan Pang, Praveen Krishnan, Lucas Kabela, Frank Seide, Samyak Datta, Jun Liu

    Abstract: This paper discusses the challenges of optical character recognition (OCR) on natural scenes, which is harder than OCR on documents due to the wild content and various image backgrounds. We propose to uniformly use word error rates (WER) as a new measurement for evaluating scene-text OCR, both end-to-end (e2e) performance and individual system component performances. Particularly for the e2e metri… ▽ More

    Submitted 25 August, 2023; originally announced August 2023.

    Comments: 9 pages

  6. arXiv:2211.00896  [pdf, other

    eess.AS cs.SD

    Factorized Blank Thresholding for Improved Runtime Efficiency of Neural Transducers

    Authors: Duc Le, Frank Seide, Yuhao Wang, Yang Li, Kjell Schubert, Ozlem Kalinli, Michael L. Seltzer

    Abstract: We show how factoring the RNN-T's output distribution can significantly reduce the computation cost and power consumption for on-device ASR inference with no loss in accuracy. With the rise in popularity of neural-transducer type models like the RNN-T for on-device ASR, optimizing RNN-T's runtime efficiency is of great interest. While previous work has primarily focused on the optimization of RNN-… ▽ More

    Submitted 4 March, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted for publication at ICASSP 2023

  7. arXiv:2204.08858  [pdf, other

    eess.AS cs.SD

    An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

    Authors: Niko Moritz, Frank Seide, Duc Le, Jay Mahadeokar, Christian Fuegen

    Abstract: The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T). Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination,… ▽ More

    Submitted 21 October, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

    Comments: Accepted to SLT 2022

  8. arXiv:2203.15966  [pdf, other

    cs.SD cs.CL eess.AS

    Federated Domain Adaptation for ASR with Full Self-Supervision

    Authors: Junteng Jia, Jay Mahadeokar, Weiyi Zheng, Yuan Shangguan, Ozlem Kalinli, Frank Seide

    Abstract: Cross-device federated learning (FL) protects user privacy by collaboratively training a model on user devices, therefore eliminating the need for collecting, storing, and manually labeling user data. While important topics such as the FL training algorithm, non-IID-ness, and Differential Privacy have been well studied in the literature, this paper focuses on two challenges of practical importance… ▽ More

    Submitted 5 April, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

  9. arXiv:1804.00344  [pdf, other

    cs.CL

    Marian: Fast Neural Machine Translation in C++

    Authors: Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, Alexandra Birch

    Abstract: We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

    Submitted 4 April, 2018; v1 submitted 1 April, 2018; originally announced April 2018.

    Comments: Demonstration paper

  10. arXiv:1803.05567  [pdf, other

    cs.CL

    Achieving Human Parity on Automatic Chinese to English News Translation

    Authors: Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, Ming Zhou

    Abstract: Machine translation has made rapid advances in recent years. Millions of people are using it today in online translation systems and mobile applications in order to communicate across language barriers. The question naturally arises whether such systems can approach or achieve parity with human translations. In this paper, we first address the problem of how to define and accurately measure human… ▽ More

    Submitted 29 June, 2018; v1 submitted 14 March, 2018; originally announced March 2018.

  11. arXiv:1610.05256  [pdf, other

    cs.CL eess.AS

    Achieving Human Parity in Conversational Speech Recognition

    Authors: W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig

    Abstract: Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity. The error rate of professional transcribers is 5.9% for the Switchboard portion of the data, in which new… ▽ More

    Submitted 17 February, 2017; v1 submitted 17 October, 2016; originally announced October 2016.

    Comments: Revised for publication, updated results

    Report number: MSR-TR-2016-71, revised Feb. 2017

  12. The Microsoft 2016 Conversational Speech Recognition System

    Authors: W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig

    Abstract: We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training… ▽ More

    Submitted 25 January, 2017; v1 submitted 12 September, 2016; originally announced September 2016.

    Journal ref: Proc. IEEE ICASSP, March 2017, pp. 5255-5259

  13. arXiv:1301.3605  [pdf, other

    cs.LG cs.CL cs.NE eess.AS

    Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks

    Authors: Dong Yu, Michael L. Seltzer, **yu Li, Jui-Ting Huang, Frank Seide

    Abstract: Recent studies have shown that deep neural networks (DNNs) perform significantly better than shallow networks and Gaussian mixture models (GMMs) on large vocabulary speech recognition tasks. In this paper, we argue that the improved accuracy achieved by the DNNs is the result of their ability to extract discriminative internal representations that are robust to the many sources of variability in s… ▽ More

    Submitted 8 March, 2013; v1 submitted 16 January, 2013; originally announced January 2013.

    Comments: ICLR 2013, 9 pages, 4 figures