Skip to main content

Showing 1–50 of 55 results for author: Yang, C H

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.14161  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang

    Abstract: We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifica… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 23 pages, Preprint

  2. arXiv:2405.06573  [pdf, other

    cs.SD cs.AI eess.AS

    An Investigation of Incorporating Mamba for Speech Enhancement

    Authors: Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao

    Abstract: This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the properties of Mamba by integrating it as the core model in both basic and advanced SE systems, along with utilizing signal-level distances as well as metric… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  3. arXiv:2404.14716  [pdf, other

    cs.CL cs.AI cs.CV cs.SD eess.AS

    Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities

    Authors: Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang

    Abstract: Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayes… ▽ More

    Submitted 16 June, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: 17 pages, 6 figures

  4. arXiv:2402.06894  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng

    Abstract: Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the divers… ▽ More

    Submitted 16 May, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

    Comments: 18 pages, Accepted by ACL 2024. This work is open sourced at: https://github.com/YUCHEN005/GenTranslate

  5. arXiv:2402.05457  [pdf, other

    cs.CL cs.AI cs.MM cs.SD eess.AS

    It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

    Authors: Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, Chao-Han Huck Yang

    Abstract: Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct map** from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introd… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted to ICLR 2024, 17 pages. This work will be open sourced under MIT license

  6. arXiv:2401.10447  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SD eess.AS

    Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition

    Authors: Yu Yu, Chao-Han Huck Yang, Tuan Dinh, Sungho Ryu, Jari Kolehmainen, Roger Ren, Denis Filimonov, Prashanth G. Shivakumar, Ankur Gandhe, Ariya Rastow, Jia Xu, Ivan Bulyko, Andreas Stolcke

    Abstract: The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dat… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

  7. arXiv:2401.10446  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EnSiong Chng

    Abstract: Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the map** from ASR N-best hypotheses to ground-truth transcription by e… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted to ICLR 2024, Spotlight top 5%, 24 pages. This work will be open sourced at: https://github.com/YUCHEN005/RobustGER under MIT license

  8. arXiv:2312.15316  [pdf, other

    cs.CL eess.AS

    Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

    Authors: Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, Ivan Bulyko

    Abstract: Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore pro… ▽ More

    Submitted 17 January, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024. Camera-ready version

  9. arXiv:2312.14378  [pdf, other

    cs.LG cs.SD eess.AS

    Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification

    Authors: Anirudh S. Sundar, Chao-Han Huck Yang, David M. Chan, Shalini Ghosh, Venkatesh Ravichandran, Phani Sankar Nidadavolu

    Abstract: Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowle… ▽ More

    Submitted 9 February, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: 5 pages, 1 figure, ICASSP 2024 Workshop on Self-supervision in Audio, Speech and Beyond

  10. arXiv:2310.13013  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Generative error correction for code-switching speech recognition using large language models

    Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Hexin Liu, Sabato Marco Siniscalchi, Eng Siong Chng

    Abstract: Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lis… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP2024

  11. arXiv:2310.06434  [pdf, other

    cs.CL cs.AI cs.MM cs.SD eess.AS

    Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

    Authors: Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Rohit Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner

    Abstract: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the exis… ▽ More

    Submitted 16 October, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

    Comments: Accepted to EMNLP 2023 as main paper. 10 pages. Revised math notations. GitHub: https://github.com/Srijith-rkr/Whispering-LLaMA

  12. arXiv:2309.15701  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

    Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Macro Siniscalchi, Pin-Yu Chen, Eng Siong Chng

    Abstract: Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuit… ▽ More

    Submitted 16 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track. Added the first Mandarin and code-switching (zh-cn and en-us) results from the LLM-based generative ASR error correction to Table 8 on Page 21

  13. arXiv:2309.15649  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

    Authors: Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, Andreas Stolcke

    Abstract: We explore the ability of large language models (LLMs) to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines caus… ▽ More

    Submitted 10 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE Automatic Speech Recognition and Understanding (ASRU) 2023. 8 pages. 2nd version revised from Sep 29th's version

    Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023

  14. arXiv:2309.15223  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SD eess.AS

    Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

    Authors: Yu Yu, Chao-Han Huck Yang, Jari Kolehmainen, Prashanth G. Shivakumar, Yile Gu, Sungho Ryu, Roger Ren, Qi Luo, Aditya Gourav, I-Fan Chen, Yi-Chieh Liu, Tuan Dinh, Ankur Gandhe, Denis Filimonov, Shalini Ghosh, Andreas Stolcke, Ariya Rastow, Ivan Bulyko

    Abstract: We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we p… ▽ More

    Submitted 10 October, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE ASRU 2023. Internal Review Approved. Revised 2nd version with Andreas and Huck. The first version is in Sep 29th. 8 pages

    Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023

  15. arXiv:2309.07081  [pdf, other

    eess.AS cs.CL cs.SD

    Can Whisper perform speech-based in-context learning?

    Authors: Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang

    Abstract: This paper investigates the in-context learning abilities of the Whisper automatic speech recognition (ASR) models released by OpenAI. A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs) with only a small number of labelled speech samples without gradient descent. Language-level adaptation experiments using Chi… ▽ More

    Submitted 19 March, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP 2024

  16. arXiv:2306.01015  [pdf, other

    cs.CL cs.NE cs.SD eess.AS

    How to Estimate Model Transferability of Pre-Trained Speech Models?

    Authors: Zih-Ching Chen, Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-yi Lee, Tara N. Sainath

    Abstract: In this work, we introduce a "score-based assessment" framework for estimating the transferability of pre-trained speech models (PSMs) for fine-tuning target tasks. We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates using the extracted representations. Our framework efficiently computes transferability… ▽ More

    Submitted 5 February, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech. Code is available at: https://github.com/virginiakm1988/LogME-CTC. Fixed a typo

  17. arXiv:2306.00331  [pdf, other

    eess.AS cs.AI cs.SD eess.SP eess.SY

    A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models

    Authors: Pin-Jui Ku, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose a multi-dimensional structured state space (S4) approach to speech enhancement. To better capture the spectral dependencies across the frequency axis, we focus on modifying the multi-dimensional S4 layer with whitening transformation to build new small-footprint models that also achieve good performance. We explore several S4-based deep architectures in time (T) and time-frequency (TF)… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech 2023. Code will be released at https://github.com/Kuray107/S4ND-U-Net_speech_enhancement

  18. arXiv:2305.16932  [pdf, other

    cs.SD cs.CL eess.AS

    A Neural State-Space Model Approach to Efficient Speech Separation

    Authors: Chen Chen, Chao-Han Huck Yang, Kai Li, Yuchen Hu, Pin-Jui Ku, Eng Siong Chng

    Abstract: In this work, we introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM). Motivated by linear time-invariant systems for sequence modeling, our SSM-based approach can efficiently model input signals into a format of linear ordinary differential equations (ODEs) for representation learning. To extend the SSM technique into speech separation tasks, we firs… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted by InterSpeech 2023

  19. arXiv:2305.11360  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Differentially Private Adapters for Parameter Efficient Acoustic Modeling

    Authors: Chun-Wei Ho, Chao-Han Huck Yang, Sabato Marco Siniscalchi

    Abstract: In this work, we devise a parameter-efficient solution to bring differential privacy (DP) guarantees into adaptation of a cross-lingual speech classifier. We investigate a new frozen pre-trained adaptation framework for DP-preserving speech modeling without full model fine-tuning. First, we introduce a noisy teacher-student ensemble into a conventional adaptation scheme leveraging a frozen pre-tra… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023. Code will be available at: https://github.com/Chun-wei-Ho/Private-Speech-Adapter. The authors would like to express their gratitude to Prof. Chin-Hui Lee from Georgia Tech for providing helpful insights and suggestions

  20. arXiv:2305.11320  [pdf, other

    cs.SD cs.AI cs.NE eess.AS eess.SP

    Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

    Authors: Li-Jen Yang, Chao-Han Huck Yang, Jen-Tzung Chien

    Abstract: This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS). A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2\% to 0.8\% of original trainable parameters to achieve competitive performance in voice synthesis. Motivated by a theoretical foundation of optimal transport (OT), this study… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  21. arXiv:2305.11244  [pdf, other

    cs.CL cs.AI cs.LG cs.NE eess.AS

    A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-Trained General-Purpose Speech Model

    Authors: Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner

    Abstract: In this work, we explore Parameter-Efficient-Learning (PEL) techniques to repurpose a General-Purpose-Speech (GSM) model for Arabic dialect identification (ADI). Specifically, we investigate different setups to incorporate trainable features into a multi-layer encoder-decoder GSM formulation under frozen pre-trained settings. Our architecture includes residual adapter and model reprogramming (inpu… ▽ More

    Submitted 3 October, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023, 5 pages. Code is available at: https://github.com/Srijith-rkr/KAUST-Whisper-Adapter under MIT license

  22. arXiv:2301.07851  [pdf, other

    cs.SD cs.AI cs.LG cs.NE eess.AS

    From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

    Authors: Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman

    Abstract: In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time… ▽ More

    Submitted 18 January, 2023; originally announced January 2023.

    Comments: Submitted to ICASSP 2023. The project was initiated in May 2022 during a research internship at Google Research

  23. arXiv:2211.01317  [pdf, other

    cs.SD cs.AI cs.LG cs.NE eess.AS

    Low-Resource Music Genre Classification with Cross-Modal Neural Model Reprogramming

    Authors: Yun-Ning Hung, Chao-Han Huck Yang, Pin-Yu Chen, Alexander Lerch

    Abstract: Transfer learning (TL) approaches have shown promising results when handling tasks with limited training data. However, considerable memory and computational resources are often required for fine-tuning pre-trained neural networks with target domain data. In this work, we introduce a novel method for leveraging pre-trained models for low-resource (music) classification based on the concept of Neur… ▽ More

    Submitted 3 May, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to IEEE ICASSP 2023. The implementation is available at https://github.com/biboamy/music-repro

  24. arXiv:2211.01263  [pdf, other

    cs.SD cs.LG eess.AS quant-ph

    A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition

    Authors: Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose a quantum kernel learning (QKL) framework to address the inherent data sparsity issues often encountered in training large-scare acoustic models in low-resource scenarios. We project acoustic features based on classical-to-quantum feature encoding. Different from existing quantum convolution techniques, we utilize QKL with features in the quantum space to design kernel-based classifiers… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  25. arXiv:2211.01189  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    Inference and Denoise: Causal Inference-based Neural Speech Enhancement

    Authors: Tsun-An Hsieh, Chao-Han Huck Yang, Pin-Yu Chen, Sabato Marco Siniscalchi, Yu Tsao

    Abstract: This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement module… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  26. arXiv:2211.00887  [pdf, other

    quant-ph cs.LG cs.NE eess.SP

    Certified Robustness of Quantum Classifiers against Adversarial Examples through Quantum Noise

    Authors: Jhih-Cing Huang, Yu-Lin Tsai, Chao-Han Huck Yang, Cheng-Fang Su, Chia-Mu Yu, Pin-Yu Chen, Sy-Yen Kuo

    Abstract: Recently, quantum classifiers have been found to be vulnerable to adversarial attacks, in which quantum classifiers are deceived by imperceptible noises, leading to misclassification. In this paper, we propose the first theoretical study demonstrating that adding quantum random rotation noise can improve robustness in quantum classifiers against adversarial attacks. We link the definition of diffe… ▽ More

    Submitted 28 April, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to IEEE ICASSP 2023

  27. arXiv:2210.06382  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition

    Authors: Chao-Han Huck Yang, Jun Qi, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose an ensemble learning framework with Poisson sub-sampling to effectively train a collection of teacher models to issue some differential privacy (DP) guarantee for training data. Through boosting under DP, a student model derived from the training data suffers little model degradation from the models trained with no privacy protection. Our proposed solution leverages upon two mechanisms,… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

    Comments: Accepted to ISCA, ISCSLP 2022, Singapore. 5 Pages

  28. arXiv:2210.05614  [pdf, other

    cs.SD cs.LG cs.NE eess.AS

    An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition

    Authors: Chao-Han Huck Yang, I-Fan Chen, Andreas Stolcke, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data. Such a noise perturbation often results in a severe performance degradation in automatic speech recognition (ASR) in order to meet a privacy budget $\varepsilon$. Private aggregation of teacher ensemble (PATE) utilizes ensemble probabilit… ▽ More

    Submitted 13 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: 5 pages. Accepted to IEEE SLT 2022. A first version draft was finished in Aug 2021

  29. arXiv:2203.15529  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    Treatment Learning Causal Transformer for Noisy Image Classification

    Authors: Chao-Han Huck Yang, I-Te Danny Hung, Yi-Chieh Liu, Pin-Yu Chen

    Abstract: Current top-notch deep learning (DL) based vision models are primarily based on exploring and exploiting the inherent correlations between training data samples and their associated labels. However, a known practical challenge is their degraded performance against "noisy" data, induced by different circumstances such as spurious correlations, irrelevant contexts, domain shift, and adversarial atta… ▽ More

    Submitted 30 October, 2023; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted to IEEE WACV 2023. The first version was finished in May 2018

  30. arXiv:2203.06031  [pdf, other

    cs.LG cs.AI cs.SD eess.AS

    Exploiting Low-Rank Tensor-Train Deep Neural Networks Based on Riemannian Gradient Descent With Illustrations of Speech Processing

    Authors: Jun Qi, Chao-Han Huck Yang, Pin-Yu Chen, Javier Tejedor

    Abstract: This work focuses on designing low complexity hybrid tensor networks by considering trade-offs between the model complexity and practical performance. Firstly, we exploit a low-rank tensor-train deep neural network (TT-DNN) to build an end-to-end deep learning pipeline, namely LR-TT-DNN. Secondly, a hybrid model combining LR-TT-DNN with a convolutional neural network (CNN), which is denoted as CNN… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

    Comments: 10 pages, 10 Figures

  31. arXiv:2203.04114  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    A study on joint modeling and data augmentation of multi-modalities for audio-visual scene classification

    Authors: Qing Wang, Jun Du, Siyuan Zheng, Yunqing Li, Yajian Wang, Yuzhong Wu, Hu Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Yannan Wang, Chin-Hui Lee

    Abstract: In this paper, we propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC). We employ pre-trained networks trained only on image data sets to extract video embedding; whereas for audio embedding models, we decide to train them from scratch. We explore different neural network architectures for joint modeling to… ▽ More

    Submitted 31 August, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

    Comments: 5 pages, 1 figure

  32. arXiv:2203.03550  [pdf, other

    cs.CL cs.AI cs.DC cs.NE eess.AS

    When BERT Meets Quantum Temporal Convolution Learning for Text Classification in Heterogeneous Computing

    Authors: Chao-Han Huck Yang, Jun Qi, Samuel Yen-Chi Chen, Yu Tsao, Pin-Yu Chen

    Abstract: The rapid development of quantum computing has demonstrated many unique characteristics of quantum advantages, such as richer feature representation and more secured protection on model parameters. This work proposes a vertical federated learning architecture based on variational quantum circuits to demonstrate the competitive performance of a quantum-enhanced pre-trained BERT model for text class… ▽ More

    Submitted 17 February, 2022; originally announced March 2022.

    Comments: Accepted to ICASSP 2022

  33. arXiv:2202.08532  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    Mitigating Closed-model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition

    Authors: Chao-Han Huck Yang, Zeeshan Ahmed, Yile Gu, Joseph Szurley, Roger Ren, Linda Liu, Andreas Stolcke, Ivan Bulyko

    Abstract: In this work, we aim to enhance the system robustness of end-to-end automatic speech recognition (ASR) against adversarially-noisy speech examples. We focus on a rigorous and empirical "closed-model adversarial robustness" setting (e.g., on-device or cloud applications). The adversarial noise is only generated by closed-model optimization (e.g., evolutionary and zeroth-order estimation) without ac… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022

  34. arXiv:2202.08509  [pdf, other

    cs.SD cs.AI cs.CV cs.LG eess.AS

    A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning

    Authors: Hengshun Zhou, Jun Du, Chao-Han Huck Yang, Shifu Xiong, Chin-Hui Lee

    Abstract: Audio-only-based wake word spotting (WWS) is challenging under noisy conditions due to environmental interference in signal transmission. In this paper, we investigate on designing a compact audio-visual WWS system by utilizing visual information to alleviate the degradation. Specifically, in order to use visual information, we first encode the detected lips to fixed-size vectors with MobileNet an… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022. H. Zhou et al

  35. arXiv:2111.14346  [pdf, other

    cs.LG cs.AI cs.CE cs.NE eess.SY

    Pessimistic Model Selection for Offline Deep Reinforcement Learning

    Authors: Chao-Han Huck Yang, Zhengling Qi, Yifan Cui, Pin-Yu Chen

    Abstract: Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. Despite its promising performance, practical gaps exist when deploying DRL in real-world scenarios. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. In particular, for offline DRL with observational data… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

    Comments: Preprint. A non-archival and preliminary venue was presented at NeurIPS 2021 Offline Reinforcement Learning Workshop

  36. arXiv:2110.08598  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    A Variational Bayesian Approach to Learning Latent Variables for Acoustic Knowledge Transfer

    Authors: Hu Hu, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Chin-Hui Lee

    Abstract: We propose a variational Bayesian (VB) approach to learning distributions of latent variables in deep neural network (DNN) models for cross-domain knowledge transfer, to address acoustic mismatches between training and testing conditions. Instead of carrying out point estimation in conventional maximum a posteriori estimation with a risk of having a curse of dimensionality in estimating a huge num… ▽ More

    Submitted 20 February, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Comments: Accepted to ICASSP 2022. Code is available at https://github.com/MihawkHu/ASC_Knowledge_Transfer

  37. arXiv:2110.03894  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    Neural Model Reprogramming with Similarity Based Map** for Low-Resource Spoken Command Recognition

    Authors: Hao Yen, Pin-Jui Ku, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Yu Tsao

    Abstract: In this study, we propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR), and build an AR-SCR system. The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model (from the source domain). To solve the label mismatches between source and target domains, and further improve the stability of AR, w… ▽ More

    Submitted 30 October, 2023; v1 submitted 8 October, 2021; originally announced October 2021.

    Comments: Accepted to Interspeech 2023. Code is available at: https://github.com/dodohow1011/SpeechAdvReprogram. Selected as Best Student Paper Candidate

  38. arXiv:2107.01461  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    A Lottery Ticket Hypothesis Framework for Low-Complexity Device-Robust Neural Acoustic Scene Classification

    Authors: Hao Yen, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Qing Wang, Yuyang Wang, Xianjun Xia, Yuanjun Zhao, Yuzhong Wu, Yannan Wang, Jun Du, Chin-Hui Lee

    Abstract: We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC). Specifically, we tackle the ASC task in a low-resource environment leveraging a recently proposed advanced neural network pruning mechanism, namely Lottery Ticket Hypothesis (LTH), to find a sub-network neural model a… ▽ More

    Submitted 1 May, 2022; v1 submitted 3 July, 2021; originally announced July 2021.

    Comments: 5 figures. DCASE 2021. The project started in November 2020. Revised version

  39. arXiv:2106.09296  [pdf, other

    cs.LG cs.AI cs.NE cs.SD eess.AS

    Voice2Series: Reprogramming Acoustic Models for Time Series Classification

    Authors: Chao-Han Huck Yang, Yun-Yun Tsai, Pin-Yu Chen

    Abstract: Learning to classify time series with limited data is a practical yet challenging problem. Current methods are primarily based on hand-designed feature extraction rules or domain-specific data augmentation. Motivated by the advances in deep speech processing models and the fact that voice data are univariate temporal signals, in this paper, we propose Voice2Series (V2S), a novel end-to-end approac… ▽ More

    Submitted 14 January, 2022; v1 submitted 17 June, 2021; originally announced June 2021.

    Comments: Updated version with a correction. The full draft was submitted in Jan 2021. The Voice2Series project initially was launched in Sep 2020. Accepted to ICML 2021, 16 Pages

    Report number: PMLR 139:11808-11819

    Journal ref: Proceedings of the 38th International Conference on Machine Learning 2021

  40. arXiv:2104.01271  [pdf, other

    cs.SD cs.AI cs.LG cs.NE eess.AS

    PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification

    Authors: Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose using an adversarial autoencoder (AAE) to replace generative adversarial network (GAN) in the private aggregation of teacher ensembles (PATE), a solution for ensuring differential privacy in speech applications. The AAE architecture allows us to obtain good synthetic speech leveraging upon a discriminative training of latent vectors. Such synthetic speech is used to build a privacy-pres… ▽ More

    Submitted 15 June, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

    Comments: Accepted to Interspeech 2021

    Journal ref: Proc. Interspeech 2021

  41. arXiv:2102.09677  [pdf, other

    cs.LG cs.AI cs.NE cs.RO eess.SY

    Training a Resilient Q-Network against Observational Interference

    Authors: Chao-Han Huck Yang, I-Te Danny Hung, Yi Ouyang, Pin-Yu Chen

    Abstract: Deep reinforcement learning (DRL) has demonstrated impressive performance in various gaming simulators and real-world applications. In practice, however, a DRL agent may receive faulty observation by abrupt interferences such as black-out, frozen-screen, and adversarial perturbation. How to design a resilient DRL algorithm against these rare but mission-critical and safety-crucial scenarios is an… ▽ More

    Submitted 25 January, 2022; v1 submitted 18 February, 2021; originally announced February 2021.

    Comments: Accepted to AAAI 2022. 9 pages

  42. arXiv:2011.11715  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SD eess.AS

    Multi-task Language Modeling for Improving Speech Recognition of Rare Words

    Authors: Chao-Han Huck Yang, Linda Liu, Ankur Gandhe, Yile Gu, Anirudh Raju, Denis Filimonov, Ivan Bulyko

    Abstract: End-to-end automatic speech recognition (ASR) systems are increasingly popular due to their relative architectural simplicity and competitive performance. However, even though the average accuracy of these systems may be high, the performance on rare content words often lags behind hybrid ASR systems. To address this problem, second-pass rescoring is often applied leveraging upon language modeling… ▽ More

    Submitted 11 September, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

    Comments: Accepted to IEEE Automatic Speech Recognition and Understanding (ASRU) 2021

  43. arXiv:2011.01447  [pdf, other

    cs.SD cs.AI cs.LG cs.NE eess.AS

    A Two-Stage Approach to Device-Robust Acoustic Scene Classification

    Authors: Hu Hu, Chao-Han Huck Yang, Xianjun Xia, Xue Bai, Xin Tang, Yajian Wang, Shutong Niu, Li Chai, Juanjuan Li, Hongning Zhu, Feng Bao, Yuanjun Zhao, Sabato Marco Siniscalchi, Yannan Wang, Jun Du, Chin-Hui Lee

    Abstract: To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed. Our two-stage system leverages on an ad-hoc score combination based on two CNN classifiers: (i) the first CNN classifies acoustic inputs into one of three broad classes, and (i… ▽ More

    Submitted 2 November, 2020; originally announced November 2020.

    Comments: Submitted to ICASSP 2021. Code available: https://github.com/MihawkHu/DCASE2020_task1

    Report number: 845--849

    Journal ref: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  44. arXiv:2010.13309  [pdf, other

    cs.SD cs.LG cs.NE eess.AS quant-ph

    Decentralizing Feature Extraction with Quantum Convolutional Neural Network for Automatic Speech Recognition

    Authors: Chao-Han Huck Yang, Jun Qi, Samuel Yen-Chi Chen, Pin-Yu Chen, Sabato Marco Siniscalchi, Xiaoli Ma, Chin-Hui Lee

    Abstract: We propose a novel decentralized feature extraction approach in federated learning to address privacy-preservation issues for speech recognition. It is built upon a quantum convolutional neural network (QCNN) composed of a quantum circuit encoder for feature extraction, and a recurrent neural network (RNN) based end-to-end acoustic model (AM). To enhance model parameter protection in a decentraliz… ▽ More

    Submitted 12 February, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: Accepted to IEEE ICASSP 2021. Code is available: https://github.com/huckiyang/QuantumSpeech-QCNN

    Journal ref: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  45. arXiv:2007.13024  [pdf, other

    eess.AS cs.CL cs.LG cs.NE cs.SD

    Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement

    Authors: Jun Qi, Hu Hu, Yannan Wang, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: This paper investigates different trade-offs between the number of model parameters and enhanced speech qualities by employing several deep tensor-to-vector regression models for speech enhancement. We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size. CNN-TT is composed of several convolutional layers at the bo… ▽ More

    Submitted 2 August, 2020; v1 submitted 25 July, 2020; originally announced July 2020.

    Comments: Accepted to InterSpeech 2020

  46. arXiv:2007.09163  [pdf, other

    cs.CV cs.LG cs.NE eess.IV

    Wavelet Channel Attention Module with a Fusion Network for Single Image Deraining

    Authors: Hao-Hsiang Yang, Chao-Han Huck Yang, Yu-Chiang Frank Wang

    Abstract: Single image deraining is a crucial problem because rain severely degenerates the visibility of images and affects the performance of computer vision tasks like outdoor surveillance systems and intelligent vehicles. In this paper, we propose the new convolutional neural network (CNN) called the wavelet channel attention module with a fusion network. Wavelet transform and the inverse wavelet transf… ▽ More

    Submitted 17 July, 2020; originally announced July 2020.

    Comments: Accepted to IEEE ICIP 2020

    Journal ref: 2020 IEEE International Conference on Image Processing (ICIP)

  47. arXiv:2007.08389  [pdf, other

    eess.AS cs.LG cs.SD

    Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

    Authors: Hu Hu, Chao-Han Huck Yang, Xianjun Xia, Xue Bai, Xin Tang, Yajian Wang, Shutong Niu, Li Chai, Juanjuan Li, Hongning Zhu, Feng Bao, Yuanjun Zhao, Sabato Marco Siniscalchi, Yannan Wang, Jun Du, Chin-Hui Lee

    Abstract: In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with cla… ▽ More

    Submitted 26 August, 2020; v1 submitted 16 July, 2020; originally announced July 2020.

    Comments: Revised Technical Report. Proposed systems attain 2nds in both Task-1a and Task-1b in the official DCASE challenge 2020

  48. arXiv:2003.13917  [pdf, other

    eess.AS cs.CL cs.CR cs.LG cs.SD

    Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement

    Authors: Chao-Han Huck Yang, Jun Qi, Pin-Yu Chen, Xiaoli Ma, Chin-Hui Lee

    Abstract: Recent studies have highlighted adversarial examples as ubiquitous threats to the deep neural network (DNN) based speech recognition systems. In this work, we present a U-Net based attention model, U-Net$_{At}$, to enhance adversarial speech signals. Specifically, we evaluate the model performance by interpretable speech recognition metrics and discuss the model performance by the augmented advers… ▽ More

    Submitted 31 December, 2021; v1 submitted 30 March, 2020; originally announced March 2020.

    Comments: The authors have revised some annotations in Table 4 to improve the clarity. The authors thank reading feedbacks from Jonathan Le Roux. The first draft was finished in August 2019. Accepted to IEEE ICASSP 2020

    Journal ref: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  49. arXiv:2003.13912  [pdf, other

    cs.CV cs.GR cs.LG eess.IV

    Y-net: Multi-scale feature aggregation network with wavelet structure similarity loss function for single image dehazing

    Authors: Hao-Hsiang Yang, Chao-Han Huck Yang, Yi-Chang James Tsai

    Abstract: Single image dehazing is the ill-posed two-dimensional signal reconstruction problem. Recently, deep convolutional neural networks (CNN) have been successfully used in many computer vision problems. In this paper, we propose a Y-net that is named for its structure. This network reconstructs clear images by aggregating multi-scale features maps. Additionally, we propose a Wavelet Structure SIMilari… ▽ More

    Submitted 30 March, 2020; originally announced March 2020.

    Comments: Accepted to IEEE ICASSP 2020

  50. Enhanced Adversarial Strategically-Timed Attacks against Deep Reinforcement Learning

    Authors: Chao-Han Huck Yang, Jun Qi, Pin-Yu Chen, Yi Ouyang, I-Te Danny Hung, Chin-Hui Lee, Xiaoli Ma

    Abstract: Recent deep neural networks based techniques, especially those equipped with the ability of self-adaptation in the system level such as deep reinforcement learning (DRL), are shown to possess many advantages of optimizing robot learning systems (e.g., autonomous navigation and continuous robot arm control.) However, the learning-based systems and the associated models may be threatened by the risk… ▽ More

    Submitted 20 February, 2020; originally announced February 2020.

    Comments: Accepted to IEEE ICASSP 2020

    Journal ref: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)