Skip to main content

Showing 1–37 of 37 results for author: Saon, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.00235  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring the limits of decoder-only models trained on public speech recognition corpora

    Authors: Ankit Gupta, George Saon, Brian Kingsbury

    Abstract: The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines. Unlike the said models, large language models are typically based on Transformer decoders, and it remains uncl… ▽ More

    Submitted 31 January, 2024; originally announced February 2024.

  2. arXiv:2311.12727  [pdf, other

    cs.LG cs.CL

    Soft Random Sampling: A Theoretical and Empirical Analysis

    Authors: Xiaodong Cui, Ashish Mittal, Songtao Lu, Wei Zhang, George Saon, Brian Kingsbury

    Abstract: Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data. SRS selects a subset uniformly at random with replacement from the full data set in each epoch. In this paper, we conduct a theoretical and empirical analysis of SRS. First, we analyze its sampling dynamics including data coverage and occupancy. N… ▽ More

    Submitted 23 November, 2023; v1 submitted 21 November, 2023; originally announced November 2023.

  3. arXiv:2309.10926  [pdf, other

    cs.CL cs.SD eess.AS

    Semi-Autoregressive Streaming ASR With Label Context

    Authors: Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury

    Abstract: Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy. Since NAR automatic speech recognition (ASR) models must wait for the completion of the entire utterance before processing, some works explore streaming NAR models based… ▽ More

    Submitted 20 February, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  4. arXiv:2309.04031  [pdf, other

    cs.CL cs.SD eess.AS

    Multiple Representation Transfer from Large Language Models to End-to-End ASR Systems

    Authors: Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Masayasu Muraoka, George Saon

    Abstract: Transferring the knowledge of large language models (LLMs) is a promising technique to incorporate linguistic knowledge into end-to-end automatic speech recognition (ASR) systems. However, existing works only transfer a single representation of LLM (e.g. the last layer of pretrained BERT), while the representation of a text is inherently non-unique and can be obtained variously from different laye… ▽ More

    Submitted 25 December, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  5. arXiv:2302.14120  [pdf, other

    eess.AS cs.SD

    Diagonal State Space Augmented Transformers for Speech Recognition

    Authors: George Saon, Ankit Gupta, Xiaodong Cui

    Abstract: We improve on the popular conformer architecture by replacing the depthwise temporal convolutions with diagonal state space (DSS) models. DSS is a recently introduced variant of linear RNNs obtained by discretizing a linear dynamical system with a diagonal state transition matrix. DSS layers project the input sequence onto a space of orthogonal polynomials where the choice of basis functions, metr… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Comments: to be presented at ICASSP 2023

  6. arXiv:2208.01818  [pdf, other

    cs.SD cs.CL eess.AS

    VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

    Authors: Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury

    Abstract: Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses. However, recent studies have shown that decoding with hypothesis merging can achieve a more efficient search with comparable or better performance. But, the full context in recurrent networks is not compatible with hypothesis merging. We propose to use vector-quantized long short-… ▽ More

    Submitted 2 August, 2022; originally announced August 2022.

    Comments: Interspeech 2022 accepted paper

  7. arXiv:2207.13965  [pdf, other

    eess.AS cs.SD

    Extending RNN-T-based speech recognition systems with emotion and language classification

    Authors: Zvi Kons, Hagai Aronowitz, Edmilson Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas, George Saon

    Abstract: Speech transcription, emotion recognition, and language identification are usually considered to be three different tasks. Each one requires a different model with a different architecture and training process. We propose using a recurrent neural network transducer (RNN-T)-based speech-to-text (STT) system as a common component that can be used for emotion recognition and language identification a… ▽ More

    Submitted 28 July, 2022; originally announced July 2022.

    Comments: Accepted for publication in Interspeech 2022

  8. arXiv:2206.07882  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization

    Authors: Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan

    Abstract: We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T). We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model (acoustic encoder and language model) and achieve near-iso-accuracy. We show that customized quantization schemes that are tailo… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: 5 pages, 2 figures, 1 table. Paper accepted to Interspeech 2022

    ACM Class: I.2.6

  9. arXiv:2204.00212  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Effect and Analysis of Large-scale Language Model Rescoring on Competitive ASR Systems

    Authors: Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Nobuyasu Itoh, George Saon

    Abstract: Large-scale language models (LLMs) such as GPT-2, BERT and RoBERTa have been successfully applied to ASR N-best rescoring. However, whether or how they can benefit competitive, near state-of-the-art ASR systems remains unexplored. In this study, we incorporate LLM rescoring into one of the most competitive ASR baselines: the Conformer-Transducer model. We demonstrate that consistent improvement is… ▽ More

    Submitted 18 August, 2022; v1 submitted 1 April, 2022; originally announced April 2022.

    Comments: Accepted to Interspeech 2022

  10. arXiv:2203.15176  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing

    Authors: Xiaodong Cui, George Saon, Tohru Nagano, Masayuki Suzuki, Takashi Fukuda, Brian Kingsbury, Gakuto Kurata

    Abstract: We introduce two techniques, length perturbation and n-best based label smoothing, to improve generalization of deep neural network (DNN) acoustic models for automatic speech recognition (ASR). Length perturbation is a data augmentation algorithm that randomly drops and inserts frames of an utterance to alter the length of the speech feature sequence. N-best based label smoothing randomly injects… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submitted to Interspeech 2022

  11. arXiv:2203.00006  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Reducing the Need for Speech Training Data To Build Spoken Language Understanding Systems

    Authors: Samuel Thomas, Hong-Kwang J. Kuo, Brian Kingsbury, George Saon

    Abstract: The lack of speech data annotated with labels required for spoken language understanding (SLU) is often a major hurdle in building end-to-end (E2E) systems that can directly process speech inputs. In contrast, large amounts of text data with suitable labels are usually available. In this paper, we propose a novel text representation and training methodology that allows E2E SLU systems to be effect… ▽ More

    Submitted 26 February, 2022; originally announced March 2022.

    Comments: \c{opyright}2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. arXiv admin note: text overlap with arXiv:2202.13155

  12. arXiv:2202.13155  [pdf, other

    cs.CL cs.SD eess.AS

    Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models

    Authors: Samuel Thomas, Brian Kingsbury, George Saon, Hong-Kwang J. Kuo

    Abstract: Compared to hybrid automatic speech recognition (ASR) systems that use a modular architecture in which each component can be independently adapted to a new domain, recent end-to-end (E2E) ASR system are harder to customize due to their all-neural monolithic construction. In this paper, we propose a novel text representation and training framework for E2E ASR models. With this approach, we show tha… ▽ More

    Submitted 26 February, 2022; originally announced February 2022.

    Comments: \c{opyright}2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  13. arXiv:2201.12105  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Improving End-to-End Models for Set Prediction in Spoken Language Understanding

    Authors: Hong-Kwang J. Kuo, Zoltan Tuske, Samuel Thomas, Brian Kingsbury, George Saon

    Abstract: The goal of spoken language understanding (SLU) systems is to determine the meaning of the input speech signal, unlike speech recognition which aims to produce verbatim transcripts. Advances in end-to-end (E2E) speech modeling have made it possible to train solely on semantic entities, which are far cheaper to collect than verbatim transcripts. We focus on this set prediction problem, where entity… ▽ More

    Submitted 28 January, 2022; originally announced January 2022.

    Comments: ICASSP \c{opyright}2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    ACM Class: I.2.7

  14. arXiv:2110.11199  [pdf, other

    cs.CL

    Asynchronous Decentralized Distributed Training of Acoustic Models

    Authors: Xiaodong Cui, Wei Zhang, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung

    Abstract: Large-scale distributed training of deep acoustic models plays an important role in today's high-performance automatic speech recognition (ASR). In this paper we investigate a variety of asynchronous decentralized distributed training strategies based on data parallel stochastic gradient descent (SGD) to show their superior performance over the commonly-used synchronous distributed training via al… ▽ More

    Submitted 21 October, 2021; originally announced October 2021.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing

  15. arXiv:2110.02743  [pdf, other

    eess.AS cs.LG cs.NE q-bio.QM

    Towards efficient end-to-end speech recognition with biologically-inspired neural networks

    Authors: Thomas Bohnstingl, Ayush Garg, Stanisław Woźniak, George Saon, Evangelos Eleftheriou, Angeliki Pantazi

    Abstract: Automatic speech recognition (ASR) is a capability which enables a program to process human speech into a written form. Recent developments in artificial intelligence (AI) have led to high-accuracy ASR systems based on deep neural networks, such as the recurrent neural network transducer (RNN-T). However, the core components and the performed operations of these approaches depart from the powerful… ▽ More

    Submitted 4 November, 2021; v1 submitted 4 October, 2021; originally announced October 2021.

    Comments: Accepted at the Efficient Natural Language and Speech Processing workshop at NeurIPS 2021

  16. arXiv:2108.12265  [pdf

    cs.CY cs.LG

    Quantum Machine Learning for Health State Diagnosis and Prognostics

    Authors: Gabriel San Martín, Enrique López Droguett

    Abstract: Quantum computing is a new field that has recently attracted researchers from a broad range of fields due to its representation power, flexibility and promising results in both speed and scalability. Since 2020, laboratories around the globe have started to experiment with models that lie in the juxtaposition between machine learning and quantum computing. The availability of quantum processing un… ▽ More

    Submitted 25 August, 2021; originally announced August 2021.

    Comments: Pre-print for RAMS 2022 Conference

  17. arXiv:2108.12074  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    4-bit Quantization of LSTM-based Speech Recognition Models

    Authors: Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei Zhang, Zoltán Tüske, Kailash Gopalakrishnan

    Abstract: We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts). Using a 4-bit integer representation, a naïve quantization approach applied to the LSTM port… ▽ More

    Submitted 26 August, 2021; originally announced August 2021.

    Comments: 5 pages, 3 figures, Andrea Fasoli and Chia-Yu Chen equally contributed to this work. Paper accepted to Interspeech 2021

    ACM Class: I.2.6

  18. arXiv:2108.10803  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    Reducing Exposure Bias in Training Recurrent Neural Network Transducers

    Authors: Xiaodong Cui, Brian Kingsbury, George Saon, David Haws, Zoltan Tuske

    Abstract: When recurrent neural network transducers (RNNTs) are trained using the typical maximum likelihood criterion, the prediction network is trained only on ground truth label sequences. This leads to a mismatch during inference, known as exposure bias, when the model must deal with label sequences containing errors. In this paper we investigate approaches to reducing exposure bias in training to impro… ▽ More

    Submitted 24 August, 2021; originally announced August 2021.

    Comments: accepted to Interspeech 2021

  19. arXiv:2108.08405  [pdf, other

    cs.CL cs.SD eess.AS

    Integrating Dialog History into End-to-End Spoken Language Understanding Systems

    Authors: Jatin Ganhotra, Samuel Thomas, Hong-Kwang J. Kuo, Sachindra Joshi, George Saon, Zoltán Tüske, Brian Kingsbury

    Abstract: End-to-end spoken language understanding (SLU) systems that process human-human or human-computer interactions are often context independent and process each turn of a conversation independently. Spoken conversations on the other hand, are very much context dependent, and dialog history contains useful information that can improve the processing of each conversational turn. In this paper, we inves… ▽ More

    Submitted 18 August, 2021; originally announced August 2021.

    Comments: Interspeech 2021

  20. arXiv:2105.00982  [pdf, other

    cs.CL cs.SD eess.AS

    On the limit of English conversational speech recognition

    Authors: Zoltán Tüske, George Saon, Brian Kingsbury

    Abstract: In our previous work we demonstrated that a single headed attention encoder-decoder model is able to reach state-of-the-art results in conversational speech recognition. In this paper, we further improve the results for both Switchboard 300 and 2000. Through use of an improved optimizer, speaker vector embeddings, and alternative speech representations we reduce the recognition errors of our LSTM… ▽ More

    Submitted 3 May, 2021; originally announced May 2021.

  21. arXiv:2104.03842  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    RNN Transducer Models For Spoken Language Understanding

    Authors: Samuel Thomas, Hong-Kwang J. Kuo, George Saon, Zoltán Tüske, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory

    Abstract: We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding(SLU). These end-to-end (E2E) models are constructed in three practical settings: a case where verbatim transcripts are available, a constrained case where the only available annotations are SLU labels and their values, and a more restrictive case where transcripts are available… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: To appear in the proceedings of ICASSP 2021

  22. arXiv:2103.09935  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Advancing RNN Transducer Technology for Speech Recognition

    Authors: George Saon, Zoltan Tueske, Daniel Bolanos, Brian Kingsbury

    Abstract: We investigate a set of techniques for RNN Transducers (RNN-Ts) that were instrumental in lowering the word error rate on three different tasks (Switchboard 300 hours, conversational Spanish 780 hours and conversational Italian 900 hours). The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe. First, we introduce a… ▽ More

    Submitted 17 March, 2021; originally announced March 2021.

    Comments: Accepted at ICASSP 2021

  23. arXiv:2002.10502  [pdf, other

    cs.DC cs.LG cs.SD eess.AS

    Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition

    Authors: Xiaodong Cui, Wei Zhang, Ulrich Finkler, George Saon, Michael Picheny, David Kung

    Abstract: The past decade has witnessed great progress in Automatic Speech Recognition (ASR) due to advances in deep learning. The improvements in performance can be attributed to both improved models and large-scale training data. Key to training such models is the employment of efficient distributed learning techniques. In this article, we provide an overview of distributed training techniques for deep ne… ▽ More

    Submitted 24 February, 2020; originally announced February 2020.

    Comments: Accepted to IEEE Signal Processing Magazine

  24. arXiv:2002.01119  [pdf, other

    cs.LG cs.DC stat.ML

    Improving Efficiency in Large-Scale Decentralized Distributed Training

    Authors: Wei Zhang, Xiaodong Cui, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, Youssef Mroueh, Alper Buyuktosunoglu, Payel Das, David Kung, Michael Picheny

    Abstract: Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks. One drawback of (A)D-PSGD is that the spectral gap of the mixing matrix decreases when the number of learners in the system increases, which hampers convergence. In this p… ▽ More

    Submitted 3 February, 2020; originally announced February 2020.

    Journal ref: 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP'2020) Oral

  25. arXiv:2001.07263  [pdf, other

    eess.AS cs.CL

    Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard

    Authors: Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury

    Abstract: It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model. Using a cross-u… ▽ More

    Submitted 19 October, 2020; v1 submitted 20 January, 2020; originally announced January 2020.

    Comments: 5 pages, 2 figures

    MSC Class: 68T10 ACM Class: I.2.7

  26. arXiv:1908.03455  [pdf, other

    cs.CL cs.SD eess.AS

    Challenging the Boundaries of Speech Recognition: The MALACH Corpus

    Authors: Michael Picheny, Zóltan Tüske, Brian Kingsbury, Kartik Audhkhasi, Xiaodong Cui, George Saon

    Abstract: There has been huge progress in speech recognition over the last several years. Tasks once thought extremely difficult, such as SWITCHBOARD, now approach levels of human performance. The MALACH corpus (LDC catalog LDC2012S05), a 375-Hour subset of a large archive of Holocaust testimonies collected by the Survivors of the Shoah Visual History Foundation, presents significant challenges to the speec… ▽ More

    Submitted 9 August, 2019; originally announced August 2019.

    Comments: Accepted for publication at INTERSPEECH 2019

  27. arXiv:1907.05701  [pdf, other

    eess.AS cs.DC cs.LG cs.SD stat.ML

    A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

    Authors: Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David Kung, Michael Picheny

    Abstract: Modern Automatic Speech Recognition (ASR) systems rely on distributed deep learning to for quick training completion. To enable efficient distributed training, it is imperative that the training algorithms can converge with a large mini-batch size. In this work, we discovered that Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) can work with much larger batch size than com… ▽ More

    Submitted 10 July, 2019; originally announced July 2019.

    Journal ref: INTERSPEECH 2019

  28. arXiv:1904.13258  [pdf, other

    cs.CL cs.SD eess.AS

    English Broadcast News Speech Recognition by Humans and Machines

    Authors: Samuel Thomas, Masayuki Suzuki, Yinghui Huang, Gakuto Kurata, Zoltan Tuske, George Saon, Brian Kingsbury, Michael Picheny, Tom Dibert, Alice Kaiser-Schatzlein, Bern Samko

    Abstract: With recent advances in deep learning, considerable attention has been given to achieving automatic speech recognition performance close to human performance on tasks like conversational telephone speech (CTS) recognition. In this paper we evaluate the usefulness of these proposed techniques on broadcast news (BN), a similar challenging task. We also perform a set of recognition measurements to un… ▽ More

    Submitted 30 April, 2019; originally announced April 2019.

    Comments: ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  29. arXiv:1904.04956  [pdf, other

    cs.SD cs.CL cs.LG eess.AS stat.ML

    Distributed Deep Learning Strategies For Automatic Speech Recognition

    Authors: Wei Zhang, Xiaodong Cui, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung, Michael Picheny

    Abstract: In this paper, we propose and investigate a variety of distributed deep learning strategies for automatic speech recognition (ASR) and evaluate them with a state-of-the-art Long short-term memory (LSTM) acoustic model on the 2000-hour Switchboard (SWB2000), which is one of the most widely used datasets for ASR performance benchmark. We first investigate what are the proper hyper-parameters (e.g.,… ▽ More

    Submitted 9 April, 2019; originally announced April 2019.

    Comments: Published in ICASSP'19

  30. arXiv:1712.03133  [pdf, other

    cs.CL cs.AI cs.NE stat.ML

    Building competitive direct acoustics-to-word models for English conversational speech recognition

    Authors: Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, Michael Picheny

    Abstract: Direct acoustics-to-word (A2W) models in the end-to-end paradigm have received increasing attention compared to conventional sub-word based automatic speech recognition models using phones, characters, or context-dependent hidden Markov model states. This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making train… ▽ More

    Submitted 8 December, 2017; originally announced December 2017.

    Comments: Submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

  31. arXiv:1710.06937  [pdf, ps, other

    cs.CL

    Embedding-Based Speaker Adaptive Training of Deep Neural Networks

    Authors: Xiaodong Cui, Vaibhava Goel, George Saon

    Abstract: An embedding-based speaker adaptive training (SAT) approach is proposed and investigated in this paper for deep neural network acoustic modeling. In this approach, speaker embedding vectors, which are a constant given a particular speaker, are mapped through a control network to layer-dependent element-wise affine transformations to canonicalize the internal feature representations at the output o… ▽ More

    Submitted 17 October, 2017; originally announced October 2017.

  32. arXiv:1709.06436  [pdf, other

    cs.CL

    Language Modeling with Highway LSTM

    Authors: Gakuto Kurata, Bhuvana Ramabhadran, George Saon, Abhinav Sethy

    Abstract: Language models (LMs) based on Long Short Term Memory (LSTM) have shown good gains in many automatic speech recognition tasks. In this paper, we extend an LSTM by adding highway networks inside an LSTM and use the resulting Highway LSTM (HW-LSTM) model for language modeling. The added highway networks increase the depth in the time dimension. Since a typical LSTM has two internal states, a memory… ▽ More

    Submitted 19 September, 2017; originally announced September 2017.

    Comments: to appear in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2017)

  33. arXiv:1703.07754  [pdf, other

    cs.CL cs.NE stat.ML

    Direct Acoustics-to-Word Models for English Conversational Speech Recognition

    Authors: Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, David Nahamoo

    Abstract: Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of map** acoustics directly to words with… ▽ More

    Submitted 22 March, 2017; originally announced March 2017.

    Comments: Submitted to Interspeech-2017

  34. arXiv:1703.02136  [pdf, other

    cs.CL

    English Conversational Telephone Speech Recognition by Humans and Machines

    Authors: George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall

    Abstract: One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to b… ▽ More

    Submitted 6 March, 2017; originally announced March 2017.

  35. arXiv:1604.08242  [pdf, other

    cs.CL

    The IBM 2016 English Conversational Telephone Speech Recognition System

    Authors: George Saon, Tom Sercu, Steven Rennie, Hong-Kwang J. Kuo

    Abstract: We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3x3 kernels, and bidir… ▽ More

    Submitted 22 June, 2016; v1 submitted 27 April, 2016; originally announced April 2016.

    Comments: Submitted to Interspeech 2016

  36. arXiv:1505.05899  [pdf, other

    cs.CL

    The IBM 2015 English Conversational Telephone Speech Recognition System

    Authors: George Saon, Hong-Kwang J. Kuo, Steven Rennie, Michael Picheny

    Abstract: We describe the latest improvements to the IBM English conversational telephone speech recognition system. Some of the techniques that were found beneficial are: maxout networks with annealed dropout rates; networks with a very large number of outputs trained on 2000 hours of data; joint modeling of partially unfolded recurrent neural networks and convolutional nets by combining the bottleneck and… ▽ More

    Submitted 21 May, 2015; originally announced May 2015.

    Comments: Submitted to Interspeech 2015

  37. arXiv:1309.1501  [pdf, ps, other

    cs.LG cs.CL cs.NE math.OC stat.ML

    Improvements to deep convolutional neural networks for LVCSR

    Authors: Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, George Saon, Hagen Soltau, Tomas Beran, Aleksandr Y. Aravkin, Bhuvana Ramabhadran

    Abstract: Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further imp… ▽ More

    Submitted 10 December, 2013; v1 submitted 5 September, 2013; originally announced September 2013.

    Comments: 6 pages, 1 figure

    MSC Class: 65K05; 90C15; 90C90