Skip to main content

Showing 1–23 of 23 results for author: Kirchhoff, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.08317  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

    Authors: Raghuveer Peri, Sai Muralidhar Jayanthi, Srikanth Ronanki, Anshu Bhatia, Karel Mundnich, Saket Dingliwal, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Srikanth Vishnubhotla, Daniel Garcia-Romero, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

    Abstract: Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: 9+6 pages, Submitted to ACL 2024

  2. arXiv:2405.08295  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechVerse: A Large-scale Generalizable Audio Language Model

    Authors: Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

    Abstract: Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore devel… ▽ More

    Submitted 31 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

    Comments: Single Column, 13 page

  3. arXiv:2307.00453  [pdf, other

    cs.CL cs.SD eess.AS

    Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters

    Authors: Anshu Bhatia, Sanchit Sinha, Saket Dingliwal, Karthik Gopalakrishnan, Sravan Bodapati, Katrin Kirchhoff

    Abstract: Speech representations learned in a self-supervised fashion from massive unlabeled speech corpora have been adapted successfully toward several downstream tasks. However, such representations may be skewed toward canonical data characteristics of such corpora and perform poorly on atypical, non-native accented speaker populations. With the state-of-the-art HuBERT model as a baseline, we propose an… ▽ More

    Submitted 1 July, 2023; originally announced July 2023.

  4. arXiv:2306.08175  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR

    Authors: Goeric Huybrechts, Srikanth Ronanki, Xilai Li, Hadis Nosrati, Sravan Bodapati, Katrin Kirchhoff

    Abstract: Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performance gap between streaming with a full and limited past context. To address this issue, we propose the… ▽ More

    Submitted 1 March, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

  5. arXiv:2305.03837  [pdf, other

    eess.AS cs.LG cs.SD

    Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR with Internal Language Model Estimation

    Authors: Nilaksh Das, Monica Sunkara, Sravan Bodapati, **glun Cai, Devang Kulshreshtha, Jeff Farris, Katrin Kirchhoff

    Abstract: End-to-end ASR models trained on large amount of data tend to be implicitly biased towards language semantics of the training data. Internal language model estimation (ILME) has been proposed to mitigate this bias for autoregressive models such as attention-based encoder-decoder and RNN-T. Typically, ILME is performed by modularizing the acoustic and language components of the model architecture,… ▽ More

    Submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted to ICASSP 2023

  6. arXiv:2211.13280  [pdf, other

    cs.CL cs.SD eess.AS

    Device Directedness with Contextual Cues for Spoken Dialog Systems

    Authors: Dhanush Bekal, Sundararajan Srinivasan, Sravan Bodapati, Srikanth Ronanki, Katrin Kirchhoff

    Abstract: In this work, we define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins. Following the success of pre-trained models, we use low-level speech representations from a self-supervised representation learning model for our downstream classification task. Further, we propose a novel technique to infu… ▽ More

    Submitted 23 November, 2022; originally announced November 2022.

  7. arXiv:2210.09510  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Personalization of CTC Speech Recognition Models with Contextual Adapters and Adaptive Boosting

    Authors: Saket Dingliwal, Monica Sunkara, Sravan Bodapati, Srikanth Ronanki, Jeff Farris, Katrin Kirchhoff

    Abstract: End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previ… ▽ More

    Submitted 13 November, 2022; v1 submitted 17 October, 2022; originally announced October 2022.

    Comments: To appear in SLT 2022

  8. arXiv:2205.10643  [pdf, other

    cs.CL cs.SD eess.AS

    Self-Supervised Speech Representation Learning: A Review

    Authors: Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

    Abstract: Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a… ▽ More

    Submitted 27 October, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

  9. arXiv:2112.05863  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.SP

    Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech

    Authors: Rohit Paturi, Sundararajan Srinivasan, Katrin Kirchhoff, Daniel Garcia-Romero

    Abstract: Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. Most of these approaches need an additional stitching step to stitch the separated speech chunks for long form audio. Since most of the approaches involve Permutation Invariant training (PIT), the order of separated speech chunks is nondeterministic and… ▽ More

    Submitted 6 September, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

    Comments: Accepted for publication at Interspeech 2022

  10. arXiv:2112.00158  [pdf

    eess.AS

    Representation learning through cross-modal conditional teacher-student training for speech emotion recognition

    Authors: Sundararajan Srinivasan, Zhaocheng Huang, Katrin Kirchhoff

    Abstract: Generic pre-trained speech and text representations promise to reduce the need for large labeled datasets on specific speech and language tasks. However, it is not clear how to effectively adapt these representations for speech emotion recognition. Recent public benchmarks show the efficacy of several popular self-supervised speech representations for emotion classification. In this study, we show… ▽ More

    Submitted 27 January, 2022; v1 submitted 30 November, 2021; originally announced December 2021.

    Comments: Accepted for publication at IEEE ICASSP 2022

  11. arXiv:2109.05092  [pdf, other

    eess.AS cs.SD

    Remember the context! ASR slot error correction through memorization

    Authors: Dhanush Bekal, Ashish Shenoy, Monica Sunkara, Sravan Bodapati, Katrin Kirchhoff

    Abstract: Accurate recognition of slot values such as domain specific words or named entities by automatic speech recognition (ASR) systems forms the core of the Goal-oriented Dialogue Systems. Although it is a critical step with direct impact on downstream tasks such as language understanding, many domain agnostic ASR systems tend to perform poorly on domain specific or long tail words. They are often supp… ▽ More

    Submitted 17 September, 2021; v1 submitted 10 September, 2021; originally announced September 2021.

    Comments: 8 pages, 3 figures, 4 tables, Accepted to ASRU 2021

  12. arXiv:2106.09532  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling

    Authors: Ashish Shenoy, Sravan Bodapati, Katrin Kirchhoff

    Abstract: Automatic Speech Recognition (ASR) robustness toward slot entities are critical in e-commerce voice assistants that involve monetary transactions and purchases. Along with effective domain adaptation, it is intuitive that cross utterance contextual cues play an important role in disambiguating domain specific content words from speech. In this paper, we investigate various techniques to improve co… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

    Comments: Accepted at ACL-IJCNLP 2021 Workshop on e-Commerce and NLP (ECNLP)

  13. arXiv:2106.05792  [pdf, other

    eess.AS

    Speaker-conversation factorial designs for diarization error analysis

    Authors: Scott Seyfarth, Sundararajan Srinivasan, Katrin Kirchhoff

    Abstract: Speaker diarization accuracy can be affected by both acoustics and conversation characteristics. Determining the cause of diarization errors is difficult because speaker voice acoustics and conversation structure co-vary, and the interactions between acoustics, conversational structure, and diarization accuracy are complex. This paper proposes a methodology that can distinguish independent margina… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: 5 pages, 2 figures, Interspeech 2021

  14. arXiv:2102.06380  [pdf, ps, other

    cs.CL eess.AS

    Neural Inverse Text Normalization

    Authors: Monica Sunkara, Chaitanya Shivade, Sravan Bodapati, Katrin Kirchhoff

    Abstract: While there have been several contributions exploring state of the art techniques for text normalization, the problem of inverse text normalization (ITN) remains relatively unexplored. The best known approaches leverage finite state transducer (FST) based models which rely on manually curated rules and are hence not scalable. We propose an efficient and robust neural solution for ITN leveraging tr… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

    Comments: 5 pages, accepted to ICASSP 2021

  15. arXiv:2011.15023  [pdf, other

    cs.CL eess.AS

    Transformer-Transducers for Code-Switched Speech Recognition

    Authors: Siddharth Dalmia, Yuzong Liu, Srikanth Ronanki, Katrin Kirchhoff

    Abstract: We live in a world where 60% of the population can speak two or more languages fluently. Members of these communities constantly switch between languages when having a conversation. As automatic speech recognition (ASR) systems are being deployed to the real-world, there is a need for practical systems that can handle multiple languages both within an utterance or across utterances. In this paper,… ▽ More

    Submitted 14 February, 2021; v1 submitted 30 November, 2020; originally announced November 2020.

    Comments: Accepted at ICASSP 2021

  16. arXiv:2010.14233  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment

    Authors: Ethan A. Chi, Julian Salazar, Katrin Kirchhoff

    Abstract: Non-autoregressive models greatly improve decoding speed over typical sequence-to-sequence models, but suffer from degraded performance. Infilling and iterative refinement models make up some of this gap by editing the outputs of a non-autoregressive model, but are constrained in the edits that they can make. We propose iterative realignment, where refinements occur over latent alignments rather t… ▽ More

    Submitted 24 October, 2020; originally announced October 2020.

    ACM Class: I.2.7

  17. arXiv:2008.00702  [pdf, other

    eess.AS cs.CL

    Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

    Authors: Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati, Katrin Kirchhoff

    Abstract: In this work, we explore a multimodal semi-supervised learning approach for punctuation prediction by learning representations from large amounts of unlabelled audio and text data. Conventional approaches in speech processing typically use forced alignment to encoder per frame acoustic features to word level features and perform multimodal fusion of the resulting acoustic and lexical representatio… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

    Comments: Accepted for Interspeech 2020

  18. arXiv:2007.02025  [pdf, other

    cs.CL cs.SD eess.AS

    Robust Prediction of Punctuation and Truecasing for Medical ASR

    Authors: Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sravan Bodapati, Katrin Kirchhoff

    Abstract: Automatic speech recognition (ASR) systems in the medical domain that focus on transcribing clinical dictations and doctor-patient conversations often pose many challenges due to the complexity of the domain. ASR output typically undergoes automatic punctuation to enable users to speak naturally, without having to vocalise awkward and explicit punctuation commands, such as "period", "add comma" or… ▽ More

    Submitted 11 July, 2020; v1 submitted 4 July, 2020; originally announced July 2020.

    Comments: Accepted for ACL NLPMC workshop 2020

  19. arXiv:1912.01679  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

    Authors: Shaoshi Ling, Yuzong Liu, Julian Salazar, Katrin Kirchhoff

    Abstract: We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a s… ▽ More

    Submitted 9 April, 2020; v1 submitted 3 December, 2019; originally announced December 2019.

    Comments: Accepted to ICASSP 2020 (oral)

  20. arXiv:1910.14659  [pdf, other

    cs.CL cs.LG eess.AS stat.ML

    Masked Language Model Scoring

    Authors: Julian Salazar, Davis Liang, Toan Q. Nguyen, Katrin Kirchhoff

    Abstract: Pretrained masked language models (MLMs) require finetuning for most NLP tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood scores (PLLs), which are computed by masking tokens one by one. We show that PLLs outperform scores from autoregressive language models like GPT-2 in a variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an end-to-end LibriSpeec… ▽ More

    Submitted 31 December, 2020; v1 submitted 31 October, 2019; originally announced October 2019.

    Comments: ACL 2020 camera-ready (presented July 2020)

    Journal ref: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), 2699-2712

  21. arXiv:1907.00457  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    BERTphone: Phonetically-Aware Encoder Representations for Utterance-Level Speaker and Language Recognition

    Authors: Shaoshi Ling, Julian Salazar, Yuzong Liu, Katrin Kirchhoff

    Abstract: We introduce BERTphone, a Transformer encoder trained on large speech corpora that outputs phonetically-aware contextual representation vectors that can be used for both speaker and language recognition. This is accomplished by training on two objectives: the first, inspired by adapting BERT to the continuous domain, involves masking spans of input frames and reconstructing the whole sequence for… ▽ More

    Submitted 29 December, 2021; v1 submitted 30 June, 2019; originally announced July 2019.

    Comments: Odyssey 2020 camera-ready (presented Nov. 2020)

    Journal ref: Proc. the Speaker and Language Recognition Workshop (Odyssey 2020), 9-16

  22. arXiv:1901.10055  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition

    Authors: Julian Salazar, Katrin Kirchhoff, Zhiheng Huang

    Abstract: The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition. Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully self-attentional net… ▽ More

    Submitted 19 February, 2019; v1 submitted 22 January, 2019; originally announced January 2019.

    Comments: Accepted to ICASSP 2019

  23. arXiv:1901.08608  [pdf, other

    cs.SD cs.MM eess.AS

    Multi-stream Network With Temporal Attention For Environmental Sound Classification

    Authors: Xinyu Li, Venkata Chebiyyam, Katrin Kirchhoff

    Abstract: Environmental sound classification systems often do not perform robustly across different sound classification tasks and audio signals of varying temporal structures. We introduce a multi-stream convolutional neural network with temporal attention that addresses these problems. The network relies on three input streams consisting of raw audio and spectral features and utilizes a temporal attention… ▽ More

    Submitted 24 January, 2019; originally announced January 2019.