Skip to main content

Showing 1–23 of 23 results for author: Zweig, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2106.07759  [pdf, ps, other

    eess.AS cs.CL

    Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

    Authors: Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised speech recognition (ASR). The proposed approach uses a teacher model which is updated as the exponential moving average (EMA) of the student model parameters. We demonstrate that it is critical for EMA to be accumulated with full-precision floating point. The Ka… ▽ More

    Submitted 27 October, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

    Comments: Updated with camera ready version

  2. arXiv:2011.04785  [pdf, ps, other

    eess.AS cs.SD

    Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR

    Authors: Xiaohui Zhang, Frank Zhang, Chunxi Liu, Kjell Schubert, Julian Chan, Pradyot Prakash, Jun Liu, Ching-Feng Yeh, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig

    Abstract: In this work, to measure the accuracy and efficiency for a latency-controlled streaming automatic speech recognition (ASR) application, we perform comprehensive evaluations on three popular training criteria: LF-MMI, CTC and RNN-T. In transcribing social media videos of 7 languages with training data 3K-14K hours, we conduct large-scale controlled experimentation across each criterion using identi… ▽ More

    Submitted 9 November, 2020; originally announced November 2020.

    Comments: Accepted for publication at IEEE Spoken Language Technology Workshop (SLT), 2021

  3. arXiv:2011.03109  [pdf, other

    cs.CL cs.SD eess.AS

    Improving RNN Transducer Based ASR with Auxiliary Tasks

    Authors: Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, Geoffrey Zweig

    Abstract: End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results compared to conventional hybrid speech recognizers. Specifically, recurrent neural network transducer (RNN-T) has shown competitive ASR performance on various benchmarks. In this work, we examine ways in which RNN-T can achieve better ASR accuracy via performing aux… ▽ More

    Submitted 8 November, 2020; v1 submitted 5 November, 2020; originally announced November 2020.

    Comments: Accepted for publication at IEEE Spoken Language Technology Workshop (SLT), 2021

  4. arXiv:2006.03411  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Contextual RNN-T For Open Domain ASR

    Authors: Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf

    Abstract: End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system - acoustic model, language model, pronunciation model - into a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this… ▽ More

    Submitted 12 August, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

  5. arXiv:2005.09150  [pdf, other

    eess.AS cs.CL

    Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

    Authors: Frank Zhang, Yongqiang Wang, Xiaohui Zhang, Chunxi Liu, Yatharth Saraf, Geoffrey Zweig

    Abstract: In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by exclud… ▽ More

    Submitted 16 August, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: In proceedings Interspeech 2020

  6. arXiv:2005.07850  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Large scale weakly and semi-supervised learning for low-resource video ASR

    Authors: Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on th… ▽ More

    Submitted 6 August, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

  7. arXiv:2005.07394  [pdf, other

    cs.CL cs.SD eess.AS

    Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model

    Authors: Da-Rong Liu, Chunxi Liu, Frank Zhang, Gabriel Synnaeve, Yatharth Saraf, Geoffrey Zweig

    Abstract: Videos uploaded on social media are often accompanied with textual descriptions. In building automatic speech recognition (ASR) systems for videos, we can exploit the contextual information provided by such video metadata. In this paper, we explore ASR lattice rescoring by selectively attending to the video descriptions. We first use an attention based method to extract contextual vector represent… ▽ More

    Submitted 15 May, 2020; originally announced May 2020.

  8. arXiv:2003.04298  [pdf, other

    cs.CV

    On Compositions of Transformations in Contrastive Self-Supervised Learning

    Authors: Mandela Patrick, Yuki M. Asano, Polina Kuznetsova, Ruth Fong, João F. Henriques, Geoffrey Zweig, Andrea Vedaldi

    Abstract: In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations via noise contrastive learning. In this paper, we generalize contrastive learning to a wider set of transformations, and their compositions, for which either invariance or distinctiveness is sought. We show that it is not immediately obvious how existing methods such as SimCLR… ▽ More

    Submitted 27 October, 2021; v1 submitted 9 March, 2020; originally announced March 2020.

    Comments: Accepted to ICCV 2021. Code and pretrained models are available at https://github.com/facebookresearch/GDT

  9. arXiv:1910.12367  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Training ASR models by Generation of Contextual Information

    Authors: Kritika Singh, Dmytro Okhonko, Jun Liu, Yongqiang Wang, Frank Zhang, Ross Girshick, Sergey Edunov, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised lea… ▽ More

    Submitted 14 February, 2020; v1 submitted 27 October, 2019; originally announced October 2019.

  10. arXiv:1910.10324  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Deja-vu: Double Feature Presentation and Iterated Loss in Deep Transformer Networks

    Authors: Andros Tjandra, Chunxi Liu, Frank Zhang, Xiaohui Zhang, Yongqiang Wang, Gabriel Synnaeve, Satoshi Nakamura, Geoffrey Zweig

    Abstract: Deep acoustic models typically receive features in the first layer of the network, and process increasingly abstract representations in the subsequent layers. Here, we propose to feed the input features at multiple depths in the acoustic model. As our motivation is to allow acoustic models to re-examine their input features in light of partial hypotheses we introduce intermediate model heads and l… ▽ More

    Submitted 13 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Accepted in IEEE ICASSP 2020

  11. Transformer-based Acoustic Modeling for Hybrid Speech Recognition

    Authors: Yongqiang Wang, Abdelrahman Mohamed, Duc Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui Zhang, Frank Zhang, Christian Fuegen, Geoffrey Zweig, Michael L. Seltzer

    Abstract: We propose and evaluate transformer-based acoustic models (AMs) for hybrid speech recognition. Several modeling choices are discussed in this work, including various positional embedding methods and an iterated loss to enable training deep transformers. We also present a preliminary study of using limited right context in transformer models, which makes it possible for streaming applications. We d… ▽ More

    Submitted 29 April, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: to appear in ICASSP 2020

  12. arXiv:1910.01493  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition

    Authors: Duc Le, Xiaohui Zhang, Weiyi Zheng, Christian Fügen, Geoffrey Zweig, Michael L. Seltzer

    Abstract: There is an implicit assumption that traditional hybrid approaches for automatic speech recognition (ASR) cannot directly model graphemes and need to rely on phonetic lexicons to get competitive performance, especially on English which has poor grapheme-phoneme correspondence. In this work, we show for the first time that, on English, hybrid ASR systems can in fact model graphemes effectively by l… ▽ More

    Submitted 11 October, 2019; v1 submitted 2 October, 2019; originally announced October 2019.

    Comments: To appear at ASRU 2019

  13. arXiv:1909.06522  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Multilingual Graphemic Hybrid ASR with Massive Data Augmentation

    Authors: Chunxi Liu, Qiaochu Zhang, Xiaohui Zhang, Kritika Singh, Yatharth Saraf, Geoffrey Zweig

    Abstract: Towards develo** high-performing ASR for low-resource languages, approaches to address the lack of resources are to make use of data from multiple languages, and to augment the training data by creating acoustic variations. In this work we present a single grapheme-based ASR model learned on 7 geographically proximal languages, using standard hybrid BLSTM-HMM acoustic models with lattice-free MM… ▽ More

    Submitted 8 April, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: Accepted for publication at the 1st Joint Workshop of SLTU (Spoken Language Technologies for Under-resourced languages) and CCURL (Collaboration and Computing for Under-Resourced Languages) (SLTU-CCURL 2020)

  14. arXiv:1702.03274  [pdf, other

    cs.AI cs.CL

    Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning

    Authors: Jason D. Williams, Kavosh Asadi, Geoffrey Zweig

    Abstract: End-to-end learning of recurrent neural networks (RNNs) is an attractive solution for dialog systems; however, current techniques are data-intensive and require thousands of dialogs to learn simple behaviors. We introduce Hybrid Code Networks (HCNs), which combine an RNN with domain-specific knowledge encoded as software and system action templates. Compared to existing end-to-end approaches, HCNs… ▽ More

    Submitted 24 April, 2017; v1 submitted 10 February, 2017; originally announced February 2017.

    Comments: Accepted as a long paper for the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017)

  15. arXiv:1610.05256  [pdf, other

    cs.CL eess.AS

    Achieving Human Parity in Conversational Speech Recognition

    Authors: W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig

    Abstract: Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity. The error rate of professional transcribers is 5.9% for the Switchboard portion of the data, in which new… ▽ More

    Submitted 17 February, 2017; v1 submitted 17 October, 2016; originally announced October 2016.

    Comments: Revised for publication, updated results

    Report number: MSR-TR-2016-71, revised Feb. 2017

  16. Advances in All-Neural Speech Recognition

    Authors: G. Zweig, C. Yu, J. Droppo, A. Stolcke

    Abstract: This paper advances the design of CTC-based all-neural (or end-to-end) speech recognizers. We propose a novel symbol inventory, and a novel iterated-CTC method in which a second system is used to transform a noisy initial output into a cleaner version. We present a number of stabilization and initialization methods we have found useful in training these networks. We evaluate our system on the comm… ▽ More

    Submitted 25 January, 2017; v1 submitted 19 September, 2016; originally announced September 2016.

    Journal ref: Proc. IEEE ICASSP, March 2017, pp. 4805-4809

  17. The Microsoft 2016 Conversational Speech Recognition System

    Authors: W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig

    Abstract: We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training… ▽ More

    Submitted 25 January, 2017; v1 submitted 12 September, 2016; originally announced September 2016.

    Journal ref: Proc. IEEE ICASSP, March 2017, pp. 5255-5259

  18. arXiv:1606.01292  [pdf, other

    cs.CL cs.HC

    An Attentional Neural Conversation Model with Improved Specificity

    Authors: Kaisheng Yao, Baolin Peng, Geoffrey Zweig, Kam-Fai Wong

    Abstract: In this paper we propose a neural conversation model for conducting dialogues. We demonstrate the use of this model to generate help desk responses, where users are asking questions about PC applications. Our model is distinguished by two characteristics. First, it models intention across turns with a recurrent network, and incorporates an attention model that is conditioned on the representation… ▽ More

    Submitted 3 June, 2016; originally announced June 2016.

  19. arXiv:1606.01269  [pdf, other

    cs.CL cs.AI cs.LG

    End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning

    Authors: Jason D. Williams, Geoffrey Zweig

    Abstract: This paper presents a model for end-to-end learning of task-oriented dialog systems. The main component of the model is a recurrent neural network (an LSTM), which maps from raw dialog history directly to a distribution over system actions. The LSTM automatically infers a representation of dialog history, which relieves the system developer of much of the manual feature engineering of dialog state… ▽ More

    Submitted 3 June, 2016; originally announced June 2016.

  20. arXiv:1510.08565  [pdf, other

    cs.NE cs.AI cs.HC cs.LG

    Attention with Intention for a Neural Network Conversation Model

    Authors: Kaisheng Yao, Geoffrey Zweig, Baolin Peng

    Abstract: In a conversation or a dialogue process, attention and intention play intrinsic roles. This paper proposes a neural network based approach that models the attention and intention processes. It essentially consists of three recurrent networks. The encoder network is a word-level model representing source side sentences. The intention network is a recurrent network that models the dynamics of the in… ▽ More

    Submitted 5 November, 2015; v1 submitted 29 October, 2015; originally announced October 2015.

  21. arXiv:1506.00196  [pdf, other

    cs.CL

    Sequence-to-Sequence Neural Net Models for Grapheme-to-Phoneme Conversion

    Authors: Kaisheng Yao, Geoffrey Zweig

    Abstract: Sequence-to-sequence translation methods based on generation with a side-conditioned language model have recently shown promising results in several tasks. In machine translation, models conditioned on source side words have been used to produce target-language text, and in image captioning, models conditioned images have been used to generate caption text. Past work with this approach has focused… ▽ More

    Submitted 20 August, 2015; v1 submitted 31 May, 2015; originally announced June 2015.

    Comments: Published in INTERSPEECH 2015, Dresden, Germany

  22. arXiv:1505.01809  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Language Models for Image Captioning: The Quirks and What Works

    Authors: Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, Margaret Mitchell

    Abstract: Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a re… ▽ More

    Submitted 14 October, 2015; v1 submitted 7 May, 2015; originally announced May 2015.

    Comments: See http://research.microsoft.com/en-us/projects/image_captioning for project information

  23. arXiv:1411.4952  [pdf, other

    cs.CV cs.CL

    From Captions to Visual Concepts and Back

    Authors: Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, Geoffrey Zweig

    Abstract: This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word det… ▽ More

    Submitted 14 April, 2015; v1 submitted 18 November, 2014; originally announced November 2014.

    Comments: version corresponding to CVPR15 paper