Skip to main content

Showing 1–39 of 39 results for author: Stolcke, A

Searching in archive eess. Search in all archives.
.
  1. arXiv:2401.14717  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

    Authors: **han Wang, Long Chen, Aparna Khare, Anirudh Raju, Pranav Dheram, Di He, Minhua Wu, Andreas Stolcke, Venkatesh Ravichandran

    Abstract: We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning… ▽ More

    Submitted 26 January, 2024; originally announced January 2024.

    Comments: To appear in IEEE ICASSP 2024

  2. Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models

    Authors: Chenyang Gao, Brecht Desplanques, Chelsea J. -T. Ju, Aman Chadha, Andreas Stolcke

    Abstract: Automated speaker identification (SID) is a crucial step for the personalization of a wide range of speech-enabled services. Typical SID systems use a symmetric enrollment-verification framework with a single model to derive embeddings both offline for voice profiles extracted from enrollment utterances, and online from runtime utterances. Due to the distinct circumstances of enrollment and runtim… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  3. arXiv:2401.10447  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SD eess.AS

    Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition

    Authors: Yu Yu, Chao-Han Huck Yang, Tuan Dinh, Sungho Ryu, Jari Kolehmainen, Roger Ren, Denis Filimonov, Prashanth G. Shivakumar, Ankur Gandhe, Ariya Rastow, Jia Xu, Ivan Bulyko, Andreas Stolcke

    Abstract: The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dat… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

  4. arXiv:2401.02921  [pdf, other

    cs.CL eess.AS

    Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

    Authors: Kevin Everson, Yile Gu, Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-yi Lee, Ariya Rastrow, Andreas Stolcke

    Abstract: In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text. In real-world scenarios, prior to input into an LLM, an automated speech recognition (ASR) system generates an output transcript hypothesis, where inherent errors ca… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  5. arXiv:2312.15316  [pdf, other

    cs.CL eess.AS

    Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

    Authors: Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, Ivan Bulyko

    Abstract: Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore pro… ▽ More

    Submitted 17 January, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024. Camera-ready version

  6. arXiv:2309.15649  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

    Authors: Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, Andreas Stolcke

    Abstract: We explore the ability of large language models (LLMs) to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines caus… ▽ More

    Submitted 10 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE Automatic Speech Recognition and Understanding (ASRU) 2023. 8 pages. 2nd version revised from Sep 29th's version

    Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023

  7. arXiv:2309.15223  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SD eess.AS

    Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

    Authors: Yu Yu, Chao-Han Huck Yang, Jari Kolehmainen, Prashanth G. Shivakumar, Yile Gu, Sungho Ryu, Roger Ren, Qi Luo, Aditya Gourav, I-Fan Chen, Yi-Chieh Liu, Tuan Dinh, Ankur Gandhe, Denis Filimonov, Shalini Ghosh, Andreas Stolcke, Ariya Rastow, Ivan Bulyko

    Abstract: We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we p… ▽ More

    Submitted 10 October, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE ASRU 2023. Internal Review Approved. Revised 2nd version with Andreas and Huck. The first version is in Sep 29th. 8 pages

    Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023

  8. Learning When to Trust Which Teacher for Weakly Supervised ASR

    Authors: Aakriti Agrawal, Milind Rao, Anit Kumar Sahu, Gopinath Chennupati, Andreas Stolcke

    Abstract: Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by t… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

    Comments: Proceedings of INTERSPEECH 2023

    Journal ref: Proc. Interspeech, Aug. 2023, pp. 381-385

  9. Streaming Speech-to-Confusion Network Speech Recognition

    Authors: Denis Filimonov, Prabhat Pandey, Ariya Rastrow, Ankur Gandhe, Andreas Stolcke

    Abstract: In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency, as needed for interactive applications. We show that 1-best results of our mo… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: Submitted to Interspeech 2023

    Journal ref: Proc. Interspeech, Aug. 2023, pp. 4099-4103

  10. PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers

    Authors: Rahul Pandey, Roger Ren, Qi Luo, **g Liu, Ariya Rastrow, Ankur Gandhe, Denis Filimonov, Grant Strimel, Andreas Stolcke, Ivan Bulyko

    Abstract: End-to-End (E2E) automatic speech recognition (ASR) systems used in voice assistants often have difficulties recognizing infrequent words personalized to the user, such as names and places. Rare words often have non-trivial pronunciations, and in such cases, human knowledge in the form of a pronunciation lexicon can be useful. We propose a PROnunCiation-aware conTextual adaptER (PROCTER) that dyna… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: To appear in Proc. IEEE ICASSP

    Journal ref: Proc. IEEE ICASSP, June 2023

  11. arXiv:2303.15132  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Cross-utterance ASR Rescoring with Graph-based Label Propagation

    Authors: Srinath Tankasala, Long Chen, Andreas Stolcke, Anirudh Raju, Qianli Deng, Chander Chandak, Aparna Khare, Roland Maas, Venkatesh Ravichandran

    Abstract: We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK da… ▽ More

    Submitted 27 March, 2023; originally announced March 2023.

    Comments: To appear in IEEE ICASSP 2023

    Journal ref: Proc. IEEE ICASSP, June 2023

  12. Adaptive Endpointing with Deep Contextual Multi-armed Bandits

    Authors: Do June Min, Andreas Stolcke, Anirudh Raju, Colin Vaz, Di He, Venkatesh Ravichandran, Viet Anh Trinh

    Abstract: Current endpointing (EP) solutions learn in a supervised framework, which does not allow the model to incorporate feedback and improve in an online setting. Also, it is a common practice to utilize costly grid-search to find the best configuration for an endpointing model. In this paper, we aim to provide a solution for adaptive endpointing by proposing an efficient method for choosing an optimal… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Journal ref: Proc. IEEE ICASSP, June 2023

  13. arXiv:2211.09731  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech

    Authors: Xin Zhang, Iván Vallés-Pérez, Andreas Stolcke, Chengzhu Yu, Jasha Droppo, Olabanji Shonibare, Roberto Barra-Chicote, Venkatesh Ravichandran

    Abstract: Stuttering is a speech disorder where the natural flow of speech is interrupted by blocks, repetitions or prolongations of syllables, words and phrases. The majority of existing automatic speech recognition (ASR) interfaces perform poorly on utterances with stutter, mainly due to lack of matched training data. Synthesis of speech with stutter thus presents an opportunity to improve ASR for this ty… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Comments: 8 pages, 3 figures, 2 tables

    Journal ref: NeurIPS Workshop on SyntheticData4ML, December 2022

  14. arXiv:2210.05614  [pdf, other

    cs.SD cs.LG cs.NE eess.AS

    An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition

    Authors: Chao-Han Huck Yang, I-Fan Chen, Andreas Stolcke, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data. Such a noise perturbation often results in a severe performance degradation in automatic speech recognition (ASR) in order to meet a privacy budget $\varepsilon$. Private aggregation of teacher ensemble (PATE) utilizes ensemble probabilit… ▽ More

    Submitted 13 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: 5 pages. Accepted to IEEE SLT 2022. A first version draft was finished in Aug 2021

  15. Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

    Authors: Pranav Dheram, Murugesan Ramakrishnan, Anirudh Raju, I-Fan Chen, Brian King, Katherine Powell, Melissa Saboowala, Karan Shetty, Andreas Stolcke

    Abstract: As for other forms of AI, speech recognition has recently been examined with respect to performance disparities across different user cohorts. One approach to achieve fairness in speech recognition is to (1) identify speaker cohorts that suffer from subpar performance and (2) apply fairness mitigation measures targeting the cohorts discovered. In this paper, we report on initial findings with both… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

    Comments: Proc. Interspeech 2022

    Journal ref: Proc. Interspeech, Sept. 2022, pp. 1268-1272

  16. Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation

    Authors: Viet Anh Trinh, Pegah Ghahremani, Brian King, Jasha Droppo, Andreas Stolcke, Roland Maas

    Abstract: We present an approach to reduce the performance disparity between geographic regions without degrading performance on the overall user population for ASR. A popular approach is to fine-tune the model with data from regions where the ASR model has a higher word error rate (WER). However, when the ASR model is adapted to get better performance on these high-WER regions, its parameters wander from t… ▽ More

    Submitted 16 July, 2022; originally announced July 2022.

    Comments: Accepted for publication at Interspeech 2022

    Journal ref: Proc. Interspeech, Sept. 2022, pp. 1298-1302

  17. Adversarial Reweighting for Speaker Verification Fairness

    Authors: Minho **, Chelsea J. -T. Ju, Zeya Chen, Yi-Chieh Liu, Jasha Droppo, Andreas Stolcke

    Abstract: We address performance fairness for speaker verification using the adversarial reweighting (ARW) method. ARW is reformulated for speaker verification with metric learning, and shown to improve results across different subgroups of gender and nationality, without requiring annotation of subgroups in the training data. An adversarial network learns a weight for each training sample in the batch so t… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

    Journal ref: Proc. Interspeech, Sept. 2022, pp. 4800-4804

  18. arXiv:2207.04081  [pdf

    eess.AS cs.CL cs.LG cs.SD eess.IV

    Graph-based Multi-View Fusion and Local Adaptation: Mitigating Within-Household Confusability for Speaker Identification

    Authors: Long Chen, Yixiong Meng, Venkatesh Ravichandran, Andreas Stolcke

    Abstract: Speaker identification (SID) in the household scenario (e.g., for smart speakers) is an important but challenging problem due to limited number of labeled (enrollment) utterances, confusable voices, and demographic imbalances. Conventional speaker recognition systems generalize from a large random sample of speakers, causing the recognition to underperform for households drawn from specific cohort… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

    Comments: To appear in Interspeech 2022. arXiv admin note: text overlap with arXiv:2106.08207

    Journal ref: Proc. Interspeech, Sept. 2022, pp. 4805-4809

  19. openFEAT: Improving Speaker Identification by Open-set Few-shot Embedding Adaptation with Transformer

    Authors: Kishan K C, Zhenning Tan, Long Chen, Minho **, Eunjung Han, Andreas Stolcke, Chul Lee

    Abstract: Household speaker identification with few enrollment utterances is an important yet challenging problem, especially when household members share similar voice characteristics and room acoustics. A common embedding space learned from a large number of speakers is not universally applicable for the optimal identification of every speaker in a household. In this work, we first formulate household spe… ▽ More

    Submitted 24 February, 2022; originally announced February 2022.

    Comments: To appear in Proc. IEEE ICASSP 2022

    Journal ref: Proc. IEEE ICASSP, May 2022, pp. 7062-7066

  20. Improving fairness in speaker verification via Group-adapted Fusion Network

    Authors: Hua Shen, Yuguang Yang, Guoli Sun, Ryan Langman, Eunjung Han, Jasha Droppo, Andreas Stolcke

    Abstract: Modern speaker verification models use deep neural networks to encode utterance audio into discriminative embedding vectors. During the training process, these networks are typically optimized to differentiate arbitrary speakers. This learning process biases the learning of fine voice characteristics towards dominant demographic groups, which can lead to an unfair performance disparity across diff… ▽ More

    Submitted 23 February, 2022; originally announced February 2022.

    Comments: To appear in Proc. IEEE ICASSP 2022

    Journal ref: Proc. IEEE ICASSP, May 2022, pp. 7077-7081

  21. Contrastive-mixup learning for improved speaker verification

    Authors: Xin Zhang, Minho **, Roger Cheng, Ruirui Li, Eunjung Han, Andreas Stolcke

    Abstract: This paper proposes a novel formulation of prototypical loss with mixup for speaker verification. Mixup is a simple yet efficient data augmentation technique that fabricates a weighted combination of random data point and label pairs for deep neural network training. Mixup has attracted increasing attention due to its ability to improve robustness and generalization of deep neural networks. Althou… ▽ More

    Submitted 22 February, 2022; originally announced February 2022.

    Journal ref: Proc. IEEE ICASSP, May 2022, pp. 7652-7656

  22. arXiv:2202.08532  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    Mitigating Closed-model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition

    Authors: Chao-Han Huck Yang, Zeeshan Ahmed, Yile Gu, Joseph Szurley, Roger Ren, Linda Liu, Andreas Stolcke, Ivan Bulyko

    Abstract: In this work, we aim to enhance the system robustness of end-to-end automatic speech recognition (ASR) against adversarially-noisy speech examples. We focus on a rigorous and empirical "closed-model adversarial robustness" setting (e.g., on-device or cloud applications). The adversarial noise is only generated by closed-model optimization (e.g., evolutionary and zeroth-order estimation) without ac… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022

  23. ASR-Aware End-to-end Neural Diarization

    Authors: Aparna Khare, Eunjung Han, Yuguang Yang, Andreas Stolcke

    Abstract: We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pret… ▽ More

    Submitted 2 February, 2022; originally announced February 2022.

    Comments: To appear in ICASSP 2022

    Journal ref: Proc. IEEE ICASSP, May 2022, pp. 8092-8096

  24. arXiv:2202.01094  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    RescoreBERT: Discriminative Speech Recognition Rescoring with BERT

    Authors: Liyan Xu, Yile Gu, Jari Kolehmainen, Haidar Khan, Ankur Gandhe, Ariya Rastrow, Andreas Stolcke, Ivan Bulyko

    Abstract: Second-pass rescoring is an important component in automatic speech recognition (ASR) systems that is used to improve the outputs from a first-pass decoder by implementing a lattice rescoring or $n$-best re-ranking. While pretraining with a masked language model (MLM) objective has received great success in various natural language understanding (NLU) tasks, it has not gained traction as a rescori… ▽ More

    Submitted 18 February, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022

    Journal ref: Proc. IEEE ICASSP, May 2022, pp. 6617-6121

  25. Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets

    Authors: Zhenning Tan, Yuguang Yang, Eunjung Han, Andreas Stolcke

    Abstract: Speaker identification typically involves three stages. First, a front-end speaker embedding model is trained to embed utterance and speaker profiles. Second, a scoring function is applied between a runtime utterance and each speaker profile. Finally, the speaker is identified using nearest neighbor according to the scoring metric. To better distinguish speakers sharing a device within the same ho… ▽ More

    Submitted 6 September, 2021; originally announced September 2021.

    Comments: Submitted to ASRU 2021

    Journal ref: Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Dec. 2021, pp. 1124-1131

  26. arXiv:2106.10169  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition

    Authors: Ruirui Li, Chelsea J. -T. Ju, Zeya Chen, Hongda Mao, Oguz Elibol, Andreas Stolcke

    Abstract: By implicitly recognizing a user based on his/her speech input, speaker identification enables many downstream applications, such as personalized system behavior and expedited shop** checkouts. Based on whether the speech content is constrained or not, both text-dependent (TD) and text-independent (TI) speaker recognition models may be used. We wish to combine the advantages of both types of mod… ▽ More

    Submitted 18 June, 2021; originally announced June 2021.

  27. End-to-end Neural Diarization: From Transformer to Conformer

    Authors: Yi Chieh Liu, Eunjung Han, Chul Lee, Andreas Stolcke

    Abstract: We propose a new end-to-end neural diarization (EEND) system that is based on Conformer, a recently proposed neural architecture that combines convolutional map**s and Transformer to model both local and global dependencies in speech. We first show that data augmentation and convolutional subsampling layers enhance the original self-attentive EEND in the Transformer-based EEND, and then Conforme… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

    Comments: To appear in Interspeech 2021

    Journal ref: Proc. Interspeech, Sept. 2021, pp. 3081-3085

  28. Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

    Authors: Swayambhu Nath Ray, Minhua Wu, Anirudh Raju, Pegah Ghahremani, Raghavendra Bilgi, Milind Rao, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Jasha Droppo

    Abstract: Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent… ▽ More

    Submitted 16 June, 2021; v1 submitted 14 May, 2021; originally announced May 2021.

    Comments: To appear in Interspeech 2021

    Journal ref: Proc. Interspeech, Sept. 2021, pp. 3455-3459

  29. arXiv:2103.08393  [pdf, other

    eess.AS cs.LG cs.SD

    Wav2vec-C: A Self-supervised Model for Speech Representation Learning

    Authors: Samik Sadhu, Di He, Che-Wei Huang, Sri Harish Mallidi, Minhua Wu, Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Roland Maas

    Abstract: Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to th… ▽ More

    Submitted 23 June, 2021; v1 submitted 9 March, 2021; originally announced March 2021.

    Comments: To appear in Interspeech 2021

  30. arXiv:2102.06750  [pdf, other

    cs.CL eess.AS

    Do as I mean, not as I say: Sequence Loss Training for Spoken Language Understanding

    Authors: Milind Rao, Pranav Dheram, Gautam Tiwari, Anirudh Raju, Jasha Droppo, Ariya Rastrow, Andreas Stolcke

    Abstract: Spoken language understanding (SLU) systems extract transcriptions, as well as semantics of intent or named entities from speech, and are essential components of voice activated systems. SLU models, which either directly extract semantics from audio or are composed of pipelined automatic speech recognition (ASR) and natural language understanding (NLU) models, are typically trained via differentia… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

    Comments: Proc. IEEE ICASSP 2021

  31. arXiv:2102.06357  [pdf, other

    cs.SD cs.LG eess.AS

    Contrastive Unsupervised Learning for Speech Emotion Recognition

    Authors: Mao Li, Bo Yang, Joshua Levy, Andreas Stolcke, Viktor Rozgic, Spyros Matsoukas, Constantinos Papayiannis, Daniel Bone, Chao Wang

    Abstract: Speech emotion recognition (SER) is a key technology to enable more natural human-machine communication. However, SER has long suffered from a lack of public large-scale labeled datasets. To circumvent this problem, we investigate how unsupervised representation learning on unlabeled datasets can benefit SER. We show that the contrastive predictive coding (CPC) method can learn salient representat… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

  32. arXiv:2012.07353  [pdf, other

    eess.AS cs.AI cs.SD

    REDAT: Accent-Invariant Representation for End-to-End ASR by Domain Adversarial Training with Relabeling

    Authors: Hu Hu, Xuesong Yang, Zeynab Raeesy, **xi Guo, Gokce Keskin, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Roland Maas

    Abstract: Accents mismatching is a critical problem for end-to-end ASR. This paper aims to address this problem by building an accent-robust RNN-T system with domain adversarial training (DAT). We unveil the magic behind DAT and provide, for the first time, a theoretical guarantee that DAT learns accent-invariant representations. We also prove that performing the gradient reversal in DAT is equivalent to mi… ▽ More

    Submitted 12 February, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

    Comments: accepted in ICASSP 2021; final camera-ready version

  33. BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers

    Authors: Eunjung Han, Chul Lee, Andreas Stolcke

    Abstract: We present a novel online end-to-end neural diarization system, BW-EDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information… ▽ More

    Submitted 12 February, 2021; v1 submitted 5 November, 2020; originally announced November 2020.

    Journal ref: Proc. IEEE ICASSP, June 2021, pp. 7193-7197

  34. arXiv:2011.01997  [pdf, other

    eess.AS cs.SD

    DOVER-Lap: A Method for Combining Overlap-aware Diarization Outputs

    Authors: Desh Raj, Leibny Paola Garcia-Perera, Zili Huang, Shinji Watanabe, Daniel Povey, Andreas Stolcke, Sanjeev Khudanpur

    Abstract: Several advances have been made recently towards handling overlap** speech for speaker diarization. Since speech and natural language tasks often benefit from ensemble techniques, we propose an algorithm for combining outputs from such diarization systems through majority voting. Our method, DOVER-Lap, is inspired from the recently proposed DOVER algorithm, but is designed to handle overlap**… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted to IEEE SLT 2021

  35. arXiv:2007.13802  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition

    Authors: **xi Guo, Gautam Tiwari, Jasha Droppo, Maarten Van Segbroeck, Che-Wei Huang, Andreas Stolcke, Roland Maas

    Abstract: In this work, we propose a novel and efficient minimum word error rate (MWER) training method for RNN-Transducer (RNN-T). Unlike previous work on this topic, which performs on-the-fly limited-size beam-search decoding and generates alignment scores for expected edit-distance computation, in our proposed method, we re-calculate and sum scores of all the possible alignments for each hypothesis in N-… ▽ More

    Submitted 27 July, 2020; originally announced July 2020.

    Comments: Accepted to Interspeech 2020

  36. arXiv:1910.11691  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Diarization Robustness using Diversification, Randomization and the DOVER Algorithm

    Authors: Andreas Stolcke

    Abstract: Speaker diarization based on bottom-up clustering of speech segments by acoustic similarity is often highly sensitive to the choice of hyperparameters, such as the initial number of clusters and feature weighting. Optimizing these hyperparameters is difficult and often not robust across different data sets. We recently proposed the DOVER algorithm for combining multiple diarization hypotheses by v… ▽ More

    Submitted 9 April, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

    Comments: Revised and expanded. To appear in Proc. Odyssey Speaker and Language Recognition Workshop. arXiv admin note: text overlap with arXiv:1909.08090

    Journal ref: Proc. Odyssey Speaker and Language Recognition Workshop, May 2020, pp. 95-101

  37. arXiv:1905.02545  [pdf, other

    eess.AS cs.CL cs.SD

    Meeting Transcription Using Virtual Microphone Arrays

    Authors: Takuya Yoshioka, Zhuo Chen, Dimitrios Dimitriadis, William Hinthorn, Xuedong Huang, Andreas Stolcke, Michael Zeng

    Abstract: We describe a system that generates speaker-annotated transcripts of meetings by using a virtual microphone array, a set of spatially distributed asynchronous recording devices such as laptops and mobile phones. The system is composed of continuous audio stream alignment, blind beamforming, speech recognition, speaker diarization using prior speaker information, and system combination. When utiliz… ▽ More

    Submitted 7 July, 2019; v1 submitted 3 May, 2019; originally announced May 2019.

    Report number: MSR-TR-2019-11

  38. arXiv:1610.05256  [pdf, other

    cs.CL eess.AS

    Achieving Human Parity in Conversational Speech Recognition

    Authors: W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig

    Abstract: Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity. The error rate of professional transcribers is 5.9% for the Switchboard portion of the data, in which new… ▽ More

    Submitted 17 February, 2017; v1 submitted 17 October, 2016; originally announced October 2016.

    Comments: Revised for publication, updated results

    Report number: MSR-TR-2016-71, revised Feb. 2017

  39. The Microsoft 2016 Conversational Speech Recognition System

    Authors: W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig

    Abstract: We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training… ▽ More

    Submitted 25 January, 2017; v1 submitted 12 September, 2016; originally announced September 2016.

    Journal ref: Proc. IEEE ICASSP, March 2017, pp. 5255-5259