Skip to main content

Showing 1–14 of 14 results for author: Han, K J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.08317  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

    Authors: Raghuveer Peri, Sai Muralidhar Jayanthi, Srikanth Ronanki, Anshu Bhatia, Karel Mundnich, Saket Dingliwal, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Srikanth Vishnubhotla, Daniel Garcia-Romero, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

    Abstract: Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: 9+6 pages, Submitted to ACL 2024

  2. arXiv:2405.08295  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechVerse: A Large-scale Generalizable Audio Language Model

    Authors: Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

    Abstract: Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore devel… ▽ More

    Submitted 31 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

    Comments: Single Column, 13 page

  3. arXiv:2404.07336  [pdf, other

    cs.CV cs.MM eess.AS

    PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores

    Authors: Lucas Goncalves, Prashant Mathur, Chandrashekhar Lavania, Metehan Cekic, Marcello Federico, Kyu J. Han

    Abstract: Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: 24 pages

  4. arXiv:2210.00077  [pdf, other

    eess.AS cs.LG

    E-Branchformer: Branchformer with Enhanced merging for speech recognition

    Authors: Kwangyoun Kim, Felix Wu, Yifan Peng, **g Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

    Abstract: Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Bra… ▽ More

    Submitted 14 October, 2022; v1 submitted 30 September, 2022; originally announced October 2022.

    Comments: Accepted to SLT 2022

  5. arXiv:2112.07648  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    On the Use of External Data for Spoken Named Entity Recognition

    Authors: Ankita Pasad, Felix Wu, Suwon Shon, Karen Livescu, Kyu J. Han

    Abstract: Spoken language understanding (SLU) tasks involve map** from speech audio signals to semantic labels. Given the complexity of such tasks, good performance might be expected to require large labeled datasets, which are difficult to collect for each new task and domain. However, recent advances in self-supervised speech representations have made it feasible to consider learning SLU models with lim… ▽ More

    Submitted 8 July, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

    Comments: Accepted at NAACL 2022. Codebase available at https://github.com/asappresearch/spoken-ner

  6. arXiv:2111.10367  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

    Authors: Suwon Shon, Ankita Pasad, Felix Wu, Pablo Brusco, Yoav Artzi, Karen Livescu, Kyu J. Han

    Abstract: Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in higher-level spoken language understanding tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks. At the same time, rece… ▽ More

    Submitted 29 July, 2022; v1 submitted 19 November, 2021; originally announced November 2021.

    Comments: Updated preprint for SLUE Benchmark v0.2; Toolkit link https://github.com/asappresearch/slue-toolkit

  7. arXiv:2106.09760  [pdf, other

    eess.AS cs.CL cs.SD

    Multi-mode Transformer Transducer with Stochastic Future Context

    Authors: Kwangyoun Kim, Felix Wu, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

    Abstract: Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed and accuracy. Naively, to fit different latency requirements, people have to store multiple models and pick the best one under the constraints. Inste… ▽ More

    Submitted 17 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021

  8. arXiv:2106.06598  [pdf, other

    cs.CL eess.AS

    Leveraging Pre-trained Language Model for Speech Sentiment Analysis

    Authors: Suwon Shon, Pablo Brusco, **g Pan, Kyu J. Han, Shinji Watanabe

    Abstract: In this paper, we explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis. First, we investigate how useful a pre-trained language model would be in a 2-step pipeline approach employing Automatic Speech Recognition (ASR) and transcripts-based sentiment analysis separately. Second, we propose a pseudo label-based semi-supervised t… ▽ More

    Submitted 11 June, 2021; originally announced June 2021.

    Comments: To appear in Interspeech 2021

  9. arXiv:2101.09624  [pdf, other

    eess.AS cs.CL cs.SD

    A Review of Speaker Diarization: Recent Advances with Deep Learning

    Authors: Tae ** Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji Watanabe, Shrikanth Narayanan

    Abstract: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify "who spoke when". In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application o… ▽ More

    Submitted 26 November, 2021; v1 submitted 23 January, 2021; originally announced January 2021.

    Comments: This article is a preprint version of the article published in Computer Speech & Language, Volume 72, March 2022, 101317

  10. arXiv:2005.10470  [pdf, other

    eess.AS cs.CL cs.SD

    Multistream CNN for Robust Acoustic Modeling

    Authors: Kyu J. Han, **g Pan, Venkata Krishna Naveen Tadala, Tao Ma, Dan Povey

    Abstract: This paper proposes multistream CNN, a novel neural network architecture for robust acoustic modeling in speech recognition tasks. The proposed architecture processes input speech with diverse temporal resolutions by applying different dilation rates to convolutional neural networks across multiple streams to achieve the robustness. The dilation rates are selected from the multiples of a sub-sampl… ▽ More

    Submitted 25 April, 2021; v1 submitted 21 May, 2020; originally announced May 2020.

    Comments: Accepted to ICASSP 2021

  11. arXiv:2005.10469  [pdf, other

    eess.AS cs.CL cs.SD

    ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition

    Authors: **g Pan, Joshua Shapiro, Jeremy Wohlwend, Kyu J. Han, Tao Lei, Tao Ma

    Abstract: In this paper we present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures, a multistream CNN for acoustic modeling and a self-attentive simple recurrent unit (SRU) for language modeling. In the hybrid ASR framework, the multistream CNN acoustic model processes an input of speech frames in multiple parallel pipelines where each stream has a u… ▽ More

    Submitted 21 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  12. Speaker Diarization with Lexical Information

    Authors: Tae ** Park, Kyu J. Han, **g Huang, Xiaodong He, Bowen Zhou, Panayiotis Georgiou, Shrikanth Narayanan

    Abstract: This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition. We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy. To integrate lexical and acoustic information in a comprehensive… ▽ More

    Submitted 13 April, 2020; originally announced April 2020.

    Journal ref: Interspeech 2019, 391-395

  13. Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap

    Authors: Tae ** Park, Kyu J. Han, Manoj Kumar, Shrikanth Narayanan

    Abstract: In this study, we propose a new spectral clustering framework that can auto-tune the parameters of the clustering algorithm in the context of speaker diarization. The proposed framework uses normalized maximum eigengap (NME) values to estimate the number of clusters and the parameters for the threshold of the elements of each row in an affinity matrix during spectral clustering, without the use of… ▽ More

    Submitted 4 March, 2020; originally announced March 2020.

    Comments: in IEEE Signal Processing Letters, 2020

  14. arXiv:1910.00716  [pdf, other

    cs.CL cs.SD eess.AS

    State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions

    Authors: Kyu J. Han, Ramon Prieto, Kaixing Wu, Tao Ma

    Abstract: Self-attention has been a huge success for many downstream tasks in NLP, which led to exploration of applying self-attention to speech problems as well. The efficacy of self-attention in speech applications, however, seems not fully blown yet since it is challenging to handle highly correlated speech frames in the context of self-attention. In this paper we propose a new neural network model archi… ▽ More

    Submitted 1 October, 2019; originally announced October 2019.

    Comments: Accepted to ASRU 2019