Skip to main content

Showing 1–11 of 11 results for author: Kuo, H J

.
  1. arXiv:2204.05188  [pdf, other

    cs.CL cs.SD eess.AS

    Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems

    Authors: Vishal Sunder, Eric Fosler-Lussier, Samuel Thomas, Hong-Kwang J. Kuo, Brian Kingsbury

    Abstract: Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to effective pretraining of speech representations. One such pretraining paradigm is the distillation of semantic knowledge from state-of-the-art text-based models like BERT to speech encoder neural networks. This work is a step towards doing the same in a much more efficient and fine-grained manner whe… ▽ More

    Submitted 1 July, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: 5 pages, 2 figures

  2. arXiv:2204.05169  [pdf, other

    cs.CL cs.AI

    Towards End-to-End Integration of Dialog History for Improved Spoken Language Understanding

    Authors: Vishal Sunder, Samuel Thomas, Hong-Kwang J. Kuo, Jatin Ganhotra, Brian Kingsbury, Eric Fosler-Lussier

    Abstract: Dialog history plays an important role in spoken language understanding (SLU) performance in a dialog system. For end-to-end (E2E) SLU, previous work has used dialog history in text form, which makes the model dependent on a cascaded automatic speech recognizer (ASR). This rescinds the benefits of an E2E system which is intended to be compact and robust to ASR errors. In this paper, we propose a h… ▽ More

    Submitted 11 April, 2022; originally announced April 2022.

    Comments: 5 pages, 1 figure

  3. arXiv:2203.00006  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Reducing the Need for Speech Training Data To Build Spoken Language Understanding Systems

    Authors: Samuel Thomas, Hong-Kwang J. Kuo, Brian Kingsbury, George Saon

    Abstract: The lack of speech data annotated with labels required for spoken language understanding (SLU) is often a major hurdle in building end-to-end (E2E) systems that can directly process speech inputs. In contrast, large amounts of text data with suitable labels are usually available. In this paper, we propose a novel text representation and training methodology that allows E2E SLU systems to be effect… ▽ More

    Submitted 26 February, 2022; originally announced March 2022.

    Comments: \c{opyright}2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. arXiv admin note: text overlap with arXiv:2202.13155

  4. arXiv:2202.13155  [pdf, other

    cs.CL cs.SD eess.AS

    Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models

    Authors: Samuel Thomas, Brian Kingsbury, George Saon, Hong-Kwang J. Kuo

    Abstract: Compared to hybrid automatic speech recognition (ASR) systems that use a modular architecture in which each component can be independently adapted to a new domain, recent end-to-end (E2E) ASR system are harder to customize due to their all-neural monolithic construction. In this paper, we propose a novel text representation and training framework for E2E ASR models. With this approach, we show tha… ▽ More

    Submitted 26 February, 2022; originally announced February 2022.

    Comments: \c{opyright}2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  5. arXiv:2201.12105  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Improving End-to-End Models for Set Prediction in Spoken Language Understanding

    Authors: Hong-Kwang J. Kuo, Zoltan Tuske, Samuel Thomas, Brian Kingsbury, George Saon

    Abstract: The goal of spoken language understanding (SLU) systems is to determine the meaning of the input speech signal, unlike speech recognition which aims to produce verbatim transcripts. Advances in end-to-end (E2E) speech modeling have made it possible to train solely on semantic entities, which are far cheaper to collect than verbatim transcripts. We focus on this set prediction problem, where entity… ▽ More

    Submitted 28 January, 2022; originally announced January 2022.

    Comments: ICASSP \c{opyright}2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    ACM Class: I.2.7

  6. arXiv:2108.08405  [pdf, other

    cs.CL cs.SD eess.AS

    Integrating Dialog History into End-to-End Spoken Language Understanding Systems

    Authors: Jatin Ganhotra, Samuel Thomas, Hong-Kwang J. Kuo, Sachindra Joshi, George Saon, Zoltán Tüske, Brian Kingsbury

    Abstract: End-to-end spoken language understanding (SLU) systems that process human-human or human-computer interactions are often context independent and process each turn of a conversation independently. Spoken conversations on the other hand, are very much context dependent, and dialog history contains useful information that can improve the processing of each conversational turn. In this paper, we inves… ▽ More

    Submitted 18 August, 2021; originally announced August 2021.

    Comments: Interspeech 2021

  7. arXiv:2104.03842  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    RNN Transducer Models For Spoken Language Understanding

    Authors: Samuel Thomas, Hong-Kwang J. Kuo, George Saon, Zoltán Tüske, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory

    Abstract: We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding(SLU). These end-to-end (E2E) models are constructed in three practical settings: a case where verbatim transcripts are available, a constrained case where the only available annotations are SLU labels and their values, and a more restrictive case where transcripts are available… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: To appear in the proceedings of ICASSP 2021

  8. arXiv:2011.08238  [pdf

    cs.CL cs.SD eess.AS

    End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features

    Authors: Edmilson Morais, Hong-Kwang J. Kuo, Samuel Thomas, Zoltan Tuske, Brian Kingsbury

    Abstract: Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need further investigation. In this paper we introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised p… ▽ More

    Submitted 16 November, 2020; originally announced November 2020.

    Comments: 5 pages, 3 tables and 1 figure

  9. arXiv:2009.14386  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    End-to-End Spoken Language Understanding Without Full Transcripts

    Authors: Hong-Kwang J. Kuo, Zoltán Tüske, Samuel Thomas, Yinghui Huang, Kartik Audhkhasi, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, Luis Lastras

    Abstract: An essential component of spoken language understanding (SLU) is slot filling: representing the meaning of a spoken utterance using semantic entity labels. In this paper, we develop end-to-end (E2E) spoken language understanding systems that directly convert speech input to semantic entities and investigate if these E2E SLU models can be trained solely on semantic entity annotations without word-f… ▽ More

    Submitted 29 September, 2020; originally announced September 2020.

    Comments: 5 pages, to be published in Interspeech 2020

    ACM Class: I.2.7

  10. arXiv:1604.08242  [pdf, other

    cs.CL

    The IBM 2016 English Conversational Telephone Speech Recognition System

    Authors: George Saon, Tom Sercu, Steven Rennie, Hong-Kwang J. Kuo

    Abstract: We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3x3 kernels, and bidir… ▽ More

    Submitted 22 June, 2016; v1 submitted 27 April, 2016; originally announced April 2016.

    Comments: Submitted to Interspeech 2016

  11. arXiv:1505.05899  [pdf, other

    cs.CL

    The IBM 2015 English Conversational Telephone Speech Recognition System

    Authors: George Saon, Hong-Kwang J. Kuo, Steven Rennie, Michael Picheny

    Abstract: We describe the latest improvements to the IBM English conversational telephone speech recognition system. Some of the techniques that were found beneficial are: maxout networks with annealed dropout rates; networks with a very large number of outputs trained on 2000 hours of data; joint modeling of partially unfolded recurrent neural networks and convolutional nets by combining the bottleneck and… ▽ More

    Submitted 21 May, 2015; originally announced May 2015.

    Comments: Submitted to Interspeech 2015