Skip to main content

Showing 1–50 of 156 results for author: Tsao, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.12699  [pdf, other

    cs.SD eess.AS eess.SP

    Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition

    Authors: Kuan-Chen Wang, You-** Li, Wei-Lun Chen, Yu-Wen Chen, Yi-Ching Wang, **-Cheng Yeh, Chao Zhang, Yu Tsao

    Abstract: Noise robustness is critical when applying automatic speech recognition (ASR) in real-world scenarios. One solution involves the used of speech enhancement (SE) models as the front end of ASR. However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study intro… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  2. arXiv:2406.08445  [pdf, other

    eess.AS cs.LG cs.SD

    SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models

    Authors: Chun Yin, Tai-Shih Chi, Yu Tsao, Hsin-Min Wang

    Abstract: Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH 2024

  3. Abnormal Respiratory Sound Identification Using Audio-Spectrogram Vision Transformer

    Authors: Whenty Ariyanti, Kai-Chun Liu, Kuan-Yu Chen, Yu Tsao

    Abstract: Respiratory disease, the third leading cause of deaths globally, is considered a high-priority ailment requiring significant research on identification and treatment. Stethoscope-recorded lung sounds and artificial intelligence-powered devices have been used to identify lung disorders and aid specialists in making accurate diagnoses. In this study, audio-spectrogram vision transformer (AS-ViT), a… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: Published in 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)

    Journal ref: 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (2023) 1-4

  4. arXiv:2405.06573  [pdf, other

    cs.SD cs.AI eess.AS

    An Investigation of Incorporating Mamba for Speech Enhancement

    Authors: Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao

    Abstract: This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the properties of Mamba by integrating it as the core model in both basic and advanced SE systems, along with utilizing signal-level distances as well as metric… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  5. arXiv:2405.04097  [pdf, other

    cs.CV cs.AI cs.CY cs.LG cs.MM

    Unmasking Illusions: Understanding Human Perception of Audiovisual Deepfakes

    Authors: Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, Hsin-Min Wang

    Abstract: The emergence of contemporary deepfakes has attracted significant attention in machine learning research, as artificial intelligence (AI) generated synthetic media increases the incidence of misinterpretation and is difficult to distinguish from genuine content. Currently, machine learning techniques have been extensively studied for automatically detecting deepfakes. However, human perception has… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

  6. arXiv:2404.14397  [pdf, other

    cs.CL cs.CY cs.LG

    RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

    Authors: Adrian de Wynter, Ishaan Watts, Nektar Ege Altıntoprak, Tua Wongsangaroonsri, Minghui Zhang, Noura Farra, Lena Baur, Samantha Claudet, Pavel Gajdusek, Can Gören, Qilong Gu, Anna Kaminska, Tomasz Kaminski, Ruby Kuo, Akiko Kyuba, Jongho Lee, Kartik Mathur, Petter Merok, Ivana Milovanović, Nani Paananen, Vesa-Matti Paananen, Anna Pavlenko, Bruno Pereira Vidal, Luciano Strika, Yueh Tsao , et al. (8 additional authors not shown)

    Abstract: Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end we introduce RTP-LX, a human-transc… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: Work in progress

  7. arXiv:2402.16757  [pdf, other

    cs.SD eess.AS

    Towards Environmental Preference Based Speech Enhancement For Individualised Multi-Modal Hearing Aids

    Authors: Jasper Kirton-Wingate, Shafique Ahmed, Adeel Hussain, Mandar Gogate, Kia Dashtipour, Jen-Cheng Hou, Tassadaq Hussain, Yu Tsao, Amir Hussain

    Abstract: Since the advent of Deep Learning (DL), Speech Enhancement (SE) models have performed well under a variety of noise conditions. However, such systems may still introduce sonic artefacts, sound unnatural, and restrict the ability for a user to hear ambient sound which may be of importance. Hearing Aid (HA) users may wish to customise their SE systems to suit their personal preferences and day-to-da… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: This has been submitted to the Trends in Hearing journal

  8. arXiv:2402.16394  [pdf, other

    eess.AS cs.SD

    Audio-Visual Speech Enhancement in Noisy Environments via Emotion-Based Contextual Cues

    Authors: Tassadaq Hussain, Kia Dashtipour, Yu Tsao, Amir Hussain

    Abstract: In real-world environments, background noise significantly degrades the intelligibility and clarity of human speech. Audio-visual speech enhancement (AVSE) attempts to restore speech quality, but existing methods often fall short, particularly in dynamic noise conditions. This study investigates the inclusion of emotion as a novel contextual cue within AVSE, hypothesizing that incorporating emotio… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  9. arXiv:2402.16321  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

    Authors: Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang

    Abstract: Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variatio… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: Published as a conference paper at ICLR 2024

  10. arXiv:2402.05482  [pdf, other

    eess.SP cs.LG

    A Non-Intrusive Neural Quality Assessment Model for Surface Electromyography Signals

    Authors: Cho-Yuan Lee, Kuan-Chen Wang, Kai-Chun Liu, Yu-Te Wang, Xugang Lu, **-Cheng Yeh, Yu Tsao

    Abstract: In practical scenarios involving the measurement of surface electromyography (sEMG) in muscles, particularly those areas near the heart, one of the primary sources of contamination is the presence of electrocardiogram (ECG) signals. To assess the quality of real-world sEMG data more effectively, this study proposes QASE-net, a new non-intrusive model that predicts the SNR of sEMG signals. QASE-net… ▽ More

    Submitted 13 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

    Comments: 5 pages, 4 figures

  11. SDEMG: Score-based Diffusion Model for Surface Electromyographic Signal Denoising

    Authors: Yu-Tung Liu, Kuan-Chen Wang, Kai-Chun Liu, Sheng-Yu Peng, Yu Tsao

    Abstract: Surface electromyography (sEMG) recordings can be influenced by electrocardiogram (ECG) signals when the muscle being monitored is close to the heart. Several existing methods use signal-processing-based approaches, such as high-pass filter and template subtraction, while some derive map** functions to restore clean sEMG signals from noisy sEMG (sEMG with ECG interference). Recently, the score-b… ▽ More

    Submitted 23 February, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

    Comments: This paper is accepted by ICASSP 2024

  12. arXiv:2401.01145  [pdf, other

    eess.AS cs.LG cs.SD

    HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids

    Authors: Dyah A. M. G. Wisnu, Stefano Rini, Ryandhimas E. Zezario, Hsin-Min Wang, Yu Tsao

    Abstract: This paper introduces HAAQI-Net, a non-intrusive deep learning model for music audio quality assessment tailored for hearing aid users. Unlike traditional methods like the Hearing Aid Audio Quality Index (HAAQI), which rely on intrusive comparisons to a reference signal, HAAQI-Net offers a more accessible and efficient alternative. Using a bidirectional Long Short-Term Memory (BLSTM) architecture… ▽ More

    Submitted 5 June, 2024; v1 submitted 2 January, 2024; originally announced January 2024.

  13. arXiv:2312.08622  [pdf, other

    eess.AS cs.LG cs.SD

    Scalable Ensemble-based Detection Method against Adversarial Attacks for speaker verification

    Authors: Haibin Wu, Heng-Cheng Kuo, Yu Tsao, Hung-yi Lee

    Abstract: Automatic speaker verification (ASV) is highly susceptible to adversarial attacks. Purification modules are usually adopted as a pre-processing to mitigate adversarial noise. However, they are commonly implemented across diverse experimental settings, rendering direct comparisons challenging. This paper comprehensively compares mainstream purification techniques in a unified framework. We find the… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    Comments: Submitted to 2024 ICASSP

  14. arXiv:2311.16604  [pdf, other

    eess.AS cs.LG

    LC4SV: A Denoising Framework Learning to Compensate for Unseen Speaker Verification Models

    Authors: Chi-Chang Lee, Hong-Wei Chen, Chu-Song Chen, Hsin-Min Wang, Tsung-Te Liu, Yu Tsao

    Abstract: The performance of speaker verification (SV) models may drop dramatically in noisy environments. A speech enhancement (SE) module can be used as a front-end strategy. However, existing SE methods may fail to bring performance improvements to downstream SV systems due to artifacts in the predicted signals of SE models. To compensate for artifacts, we propose a generic denoising framework named LC4S… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  15. arXiv:2311.16595  [pdf, other

    cs.SD cs.LG eess.AS

    D4AM: A General Denoising Framework for Downstream Acoustic Models

    Authors: Chi-Chang Lee, Yu Tsao, Hsin-Min Wang, Chu-Song Chen

    Abstract: The performance of acoustic models degrades notably in noisy environments. Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems. However, existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems. In this study, we propose a general denoisi… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  16. arXiv:2311.15582  [pdf, other

    cs.SD cs.LG eess.AS

    Lightly Weighted Automatic Audio Parameter Extraction for the Quality Assessment of Consensus Auditory-Perceptual Evaluation of Voice

    Authors: Yi-Heng Lin, Wen-Hsuan Tseng, Li-Chin Chen, Ching-Ting Tan, Yu Tsao

    Abstract: The Consensus Auditory-Perceptual Evaluation of Voice is a widely employed tool in clinical voice quality assessment that is significant for streaming communication among clinical professionals and benchmarking for the determination of further treatment. Currently, because the assessment relies on experienced clinicians, it tends to be inconsistent, and thus, difficult to standardize. To address t… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Comments: Published in IEEE 42th International Conference on Consumer Electronics (ICCE 2024)

  17. arXiv:2311.08878  [pdf, other

    eess.AS cs.SD

    Multi-objective Non-intrusive Hearing-aid Speech Assessment Model

    Authors: Hsin-Tien Chiang, Szu-Wei Fu, Hsin-Min Wang, Yu Tsao, John H. L. Hansen

    Abstract: Without the need for a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. While deep learning models have been used to develop non-intrusive speech assessment methods with promising results, there is limited research on hearing-impaired subjects. This study proposes a multi-objective non-intrusive hearing-aid speech assessment model, cal… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

  18. arXiv:2311.02733  [pdf, other

    cs.CV cs.AI cs.LG cs.MM cs.SD eess.AS

    AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

    Authors: Sahibzada Adil Shahzad, Ammarah Hashmi, Yan-Tsung Peng, Yu Tsao, Hsin-Min Wang

    Abstract: Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multi-modal models that can exploit both pieces of information simultaneou… ▽ More

    Submitted 5 November, 2023; originally announced November 2023.

  19. arXiv:2310.13471  [pdf, ps, other

    eess.AS cs.SD

    Neural domain alignment for spoken language recognition based on optimal transport

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Domain shift poses a significant challenge in cross-domain spoken language recognition (SLR) by reducing its effectiveness. Unsupervised domain adaptation (UDA) algorithms have been explored to address domain shifts in SLR without relying on class labels in the target domain. One successful UDA approach focuses on learning domain-invariant representations to align feature distributions between dom… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  20. arXiv:2310.13103  [pdf, other

    cs.CV cs.AI cs.LG cs.MM cs.SD eess.AS

    AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection

    Authors: Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, Hsin-Min Wang

    Abstract: Forged content shared widely on social media platforms is a major social problem that requires increased regulation and poses new challenges to the research community. The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilizes visual modality or audio modality. W… ▽ More

    Submitted 19 October, 2023; originally announced October 2023.

  21. arXiv:2309.16093  [pdf, ps, other

    eess.AS cs.SD

    Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained language model (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task. In this study, we propose a cross-modality knowledge transfer (CMKT) learning framework in a temporal connectionist temporal classification (CTC) base… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  22. arXiv:2309.13650  [pdf, ps, other

    eess.AS cs.SD

    Cross-modal Alignment with Optimal Transport for CTC-based ASR

    Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

    Abstract: Temporal connectionist temporal classification (CTC)-based automatic speech recognition (ASR) is one of the most successful end to end (E2E) ASR frameworks. However, due to the token independence assumption in decoding, an external language model (LM) is required which destroys its fast parallel decoding property. Several studies have been proposed to transfer linguistic knowledge from a pretraine… ▽ More

    Submitted 24 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE ASRU 2023

  23. arXiv:2309.12766  [pdf, other

    eess.AS cs.SD

    A Study on Incorporating Whisper for Robust Speech Assessment

    Authors: Ryandhimas E. Zezario, Yu-Wen Chen, Szu-Wei Fu, Yu Tsao, Hsin-Min Wang, Chiou-Shann Fuh

    Abstract: This research introduces an enhanced version of the multi-objective speech assessment model--MOSA-Net+, by leveraging the acoustic features from Whisper, a large-scaled weakly supervised model. We first investigate the effectiveness of Whisper in deploying a more robust speech assessment model. After that, we explore combining representations from Whisper and SSL models. The experimental results r… ▽ More

    Submitted 29 April, 2024; v1 submitted 22 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE ICME 2024

  24. arXiv:2309.11059  [pdf, other

    eess.AS cs.SD

    Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement

    Authors: Shafique Ahmed, Chia-Wei Chen, Wenze Ren, Chin-Jou Li, Ernie Chu, Jun-Cheng Chen, Amir Hussain, Hsin-Min Wang, Yu Tsao, Jen-Cheng Hou

    Abstract: Recent studies have increasingly acknowledged the advantages of incorporating visual data into speech enhancement (SE) systems. In this paper, we introduce a novel audio-visual SE approach, termed DCUC-Net (deep complex U-Net with conformer network). The proposed DCUC-Net leverages complex domain features and a stack of conformer blocks. The encoder and decoder of DCUC-Net are designed using a com… ▽ More

    Submitted 8 October, 2023; v1 submitted 20 September, 2023; originally announced September 2023.

  25. arXiv:2309.10787  [pdf, other

    eess.AS cs.CV cs.MM cs.SD

    AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

    Authors: Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee

    Abstract: Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual a… ▽ More

    Submitted 19 March, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024; Evaluation Code: https://github.com/roger-tseng/av-superb Submission Platform: https://av.superbbenchmark.org

  26. arXiv:2309.09548  [pdf, other

    eess.AS cs.LG cs.SD

    Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata

    Authors: Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

    Abstract: Automated speech intelligibility assessment is pivotal for hearing aid (HA) development. In this paper, we present three novel methods to improve intelligibility prediction accuracy and introduce MBI-Net+, an enhanced version of MBI-Net, the top-performing system in the 1st Clarity Prediction Challenge. MBI-Net+ leverages Whisper's embeddings to create cross-domain acoustic features and includes m… ▽ More

    Submitted 13 June, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: Accepted to Interspeech 2024

  27. arXiv:2309.01164  [pdf, other

    eess.AS cs.LG cs.SD

    Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement

    Authors: Yu-Wen Chen, Julia Hirschberg, Yu Tsao

    Abstract: Speech emotion recognition (SER) often experiences reduced performance due to background noise. In addition, making a prediction on signals with only background noise could undermine user trust in the system. In this study, we propose a Noise Robust Speech Emotion Recognition system, NRSER. NRSER employs speech enhancement (SE) to effectively reduce the noise in input signals. Then, the signal-to-… ▽ More

    Submitted 3 September, 2023; originally announced September 2023.

  28. arXiv:2308.09262  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model

    Authors: Ryandhimas E. Zezario, Bo-Ren Brian Bai, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

    Abstract: This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net. MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning. The 3QUEST metrics, namely Speech-MOS (S-MOS), Noise-MOS (N-MOS), and General-MOS (G-MOS), are the assessment targets. The pretrained MOSA-Net model is u… ▽ More

    Submitted 13 March, 2024; v1 submitted 17 August, 2023; originally announced August 2023.

    Comments: Accepted to IEEE ICASSP 2024

  29. arXiv:2306.10756  [pdf, other

    cs.CV cs.AI

    A HRNet-based Rehabilitation Monitoring System

    Authors: Yi-Ching Hung, Yu-Qing Jiang, Fong-Syuan Liou, Yu-Hsuan Tsao, Zi-Cing Chiang, MIn-Te Sun

    Abstract: The rehabilitation treatment helps to heal minor sports and occupational injuries. In a traditional rehabilitation process, a therapist will assign certain actions to a patient to perform in between hospital visits, and it will rely on the patient to remember actions correctly and the schedule to perform them. Unfortunately, many patients forget to perform actions or fail to recall actions in deta… ▽ More

    Submitted 14 July, 2023; v1 submitted 19 June, 2023; originally announced June 2023.

  30. arXiv:2306.06865  [pdf, other

    cs.LG cs.AI eess.SP

    Deep denoising autoencoder-based non-invasive blood flow detection for arteriovenous fistula

    Authors: Li-Chin Chen, Yi-Heng Lin, Li-Ning Peng, Feng-Ming Wang, Yu-Hsin Chen, Po-Hsun Huang, Shang-Feng Yang, Yu Tsao

    Abstract: Clinical guidelines underscore the importance of regularly monitoring and surveilling arteriovenous fistula (AVF) access in hemodialysis patients to promptly detect any dysfunction. Although phono-angiography/sound analysis overcomes the limitations of standardized AVF stenosis diagnosis tool, prior studies have depended on conventional feature extraction methods, restricting their applicability i… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

  31. arXiv:2306.06653  [pdf, other

    cs.SD eess.AS

    Mandarin Electrolaryngeal Speech Voice Conversion using Cross-domain Features

    Authors: Hsin-Hao Chen, Yung-Lun Chien, Ming-Chi Yen, Shu-Wei Tsai, Yu Tsao, Tai-shih Chi, Hsin-Min Wang

    Abstract: Patients who have had their entire larynx removed, including the vocal folds, owing to throat cancer may experience difficulties in speaking. In such cases, electrolarynx devices are often prescribed to produce speech, which is commonly referred to as electrolaryngeal speech (EL speech). However, the quality and intelligibility of EL speech are poor. To address this problem, EL voice conversion (E… ▽ More

    Submitted 11 June, 2023; originally announced June 2023.

    Comments: Accepted to INTERSPEECH 2023

  32. arXiv:2306.06652  [pdf, other

    cs.SD eess.AS

    Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion

    Authors: Yung-Lun Chien, Hsin-Hao Chen, Ming-Chi Yen, Shu-Wei Tsai, Hsin-Min Wang, Yu Tsao, Tai-Shih Chi

    Abstract: Electrolarynx is a commonly used assistive device to help patients with removed vocal cords regain their ability to speak. Although the electrolarynx can generate excitation signals like the vocal cords, the naturalness and intelligibility of electrolaryngeal (EL) speech are very different from those of natural (NL) speech. Many deep-learning-based models have been applied to electrolaryngeal spee… ▽ More

    Submitted 11 June, 2023; originally announced June 2023.

    Comments: Accepted to INTERSPEECH 2023

  33. arXiv:2305.16753  [pdf, other

    eess.AS cs.AI eess.SP

    ElectrodeNet -- A Deep Learning Based Sound Coding Strategy for Cochlear Implants

    Authors: Enoch Hsin-Ho Huang, Rong Chao, Yu Tsao, Chao-Min Wu

    Abstract: ElectrodeNet, a deep learning based sound coding strategy for the cochlear implant (CI), is proposed to emulate the advanced combination encoder (ACE) strategy by replacing the conventional envelope detection using various artificial neural networks. The extended ElectrodeNet-CS strategy further incorporates the channel selection (CS). Network models of deep neural network (DNN), convolutional neu… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: 12 pages and 7 figures. Preprint version; IEEE Transactions on Cognitive and Developmental Systems (accepted)

  34. arXiv:2304.06335  [pdf

    cs.LG eess.SP

    Deep Learning-based Fall Detection Algorithm Using Ensemble Model of Coarse-fine CNN and GRU Networks

    Authors: Chien-Pin Liu, Ju-Hsuan Li, En-** Chu, Chia-Yeh Hsieh, Kai-Chun Liu, Chia-Tai Chan, Yu Tsao

    Abstract: Falls are the public health issue for the elderly all over the world since the fall-induced injuries are associated with a large amount of healthcare cost. Falls can cause serious injuries, even leading to death if the elderly suffers a "long-lie". Hence, a reliable fall detection (FD) system is required to provide an emergency alarm for first aid. Due to the advances in wearable device technology… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.

  35. arXiv:2303.09085  [pdf, other

    cs.LG

    Preoperative Prognosis Assessment of Lumbar Spinal Surgery for Low Back Pain and Sciatica Patients based on Multimodalities and Multimodal Learning

    Authors: Li-Chin Chen, Jung-Nien Lai, Hung-En Lin, Hsien-Te Chen, Kuo-Hsuan Hung, Yu Tsao

    Abstract: Low back pain (LBP) and sciatica may require surgical therapy when they are symptomatic of severe pain. However, there is no effective measures to evaluate the surgical outcomes in advance. This work combined elements of Eastern medicine and machine learning, and developed a preoperative assessment tool to predict the prognosis of lumbar spinal surgery in LBP and sciatica patients. Standard operat… ▽ More

    Submitted 16 March, 2023; originally announced March 2023.

  36. Self-supervised learning-based general laboratory progress pretrained model for cardiovascular event detection

    Authors: Li-Chin Chen, Kuo-Hsuan Hung, Yi-Ju Tseng, Hsin-Yao Wang, Tse-Min Lu, Wei-Chieh Huang, Yu Tsao

    Abstract: The inherent nature of patient data poses several challenges. Prevalent cases amass substantial longitudinal data owing to their patient volume and consistent follow-ups, however, longitudinal laboratory data are renowned for their irregularity, temporality, absenteeism, and sparsity; In contrast, recruitment for rare or specific cases is often constrained due to their limited patient size and epi… ▽ More

    Submitted 7 September, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

    Comments: published in IEEE Journal of Translational Engineering in Health & Medicine

    Journal ref: IEEE Journal of Translational Engineering in Health and Medicine, vol.12, p.43-56, 2023

  37. arXiv:2303.03634  [pdf

    eess.SP cs.LG

    PreFallKD: Pre-Impact Fall Detection via CNN-ViT Knowledge Distillation

    Authors: Tin-Han Chi, Kai-Chun Liu, Chia-Yeh Hsieh, Yu Tsao, Chia-Tai Chan

    Abstract: Fall accidents are critical issues in an aging and aged society. Recently, many researchers developed pre-impact fall detection systems using deep learning to support wearable-based fall protection systems for preventing severe injuries. However, most works only employed simple neural network models instead of complex models considering the usability in resource-constrained mobile devices and stri… ▽ More

    Submitted 28 March, 2023; v1 submitted 6 March, 2023; originally announced March 2023.

  38. arXiv:2302.01798  [pdf, other

    cs.LG

    Interpretations of Domain Adaptations via Layer Variational Analysis

    Authors: Huan-Hsin Tseng, Hsin-Yi Lin, Kuo-Hsuan Hung, Yu Tsao

    Abstract: Transfer learning is known to perform efficiently in many applications empirically, yet limited literature reports the mechanism behind the scene. This study establishes both formal derivations and heuristic analysis to formulate the theory of transfer learning in deep learning. Our framework utilizing layer variational analysis proves that the success of transfer learning can be guaranteed with c… ▽ More

    Submitted 9 May, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

    Comments: Published at ICLR 2023

  39. arXiv:2301.04120  [pdf, other

    cs.NE cs.AI cs.CL cs.LG eess.AS

    BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm

    Authors: Yu-Wen Chen, Hsin-Min Wang, Yu Tsao

    Abstract: The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation. In this study, we propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences for collecting Mandarin Chinese speech data. First, we used pretrained natural language processing systems to e… ▽ More

    Submitted 10 December, 2022; originally announced January 2023.

    Comments: accepted by APSIPA Transactions on Signal and Information Processing

  40. arXiv:2211.06508  [pdf, other

    cs.SD cs.LG eess.AS

    On the robustness of non-intrusive speech quality model by adversarial examples

    Authors: Hsin-Yi Lin, Huan-Hsin Tseng, Yu Tsao

    Abstract: It has been shown recently that deep learning based models are effective on speech quality prediction and could outperform traditional metrics in various perspectives. Although network models have potential to be a surrogate for complex human hearing perception, they may contain instabilities in predictions. This work shows that deep speech quality predictors can be vulnerable to adversarial pertu… ▽ More

    Submitted 11 November, 2022; originally announced November 2022.

  41. arXiv:2211.01189  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    Inference and Denoise: Causal Inference-based Neural Speech Enhancement

    Authors: Tsun-An Hsieh, Chao-Han Huck Yang, Pin-Yu Chen, Sabato Marco Siniscalchi, Yu Tsao

    Abstract: This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement module… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  42. arXiv:2211.00586  [pdf, other

    cs.CL cs.SD eess.AS

    T5lephone: Bridging Speech and Text Self-supervised Models for Spoken Language Understanding via Phoneme level T5

    Authors: Chan-Jan Hsu, Ho-Lam Chung, Hung-yi Lee, Yu Tsao

    Abstract: In Spoken language understanding (SLU), a natural solution is concatenating pre-trained speech models (e.g. HuBERT) and pretrained language models (PLM, e.g. T5). Most previous works use pretrained language models with subword-based tokenization. However, the granularity of input units affects the alignment of speech model outputs and language model inputs, and PLM with character-based tokenizatio… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

  43. arXiv:2210.17456  [pdf, other

    eess.AS cs.SD

    Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings

    Authors: I-Chun Chern, Kuo-Hsuan Hung, Yi-Ting Chen, Tassadaq Hussain, Mandar Gogate, Amir Hussain, Yu Tsao, Jen-Cheng Hou

    Abstract: AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-moda… ▽ More

    Submitted 31 May, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: ICASSP AMHAT 2023

  44. arXiv:2210.15370  [pdf, other

    cs.SD cs.AI cs.CL cs.LG cs.MM eess.AS

    CasNet: Investigating Channel Robustness for Speech Separation

    Authors: Fan-Lin Wang, Yao-Fei Cheng, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang

    Abstract: Recording channel mismatch between training and testing conditions has been shown to be a serious problem for speech separation. This situation greatly reduces the separation performance, and cannot meet the requirement of daily use. In this study, inheriting the use of our previously constructed TAT-2mix corpus, we address the channel mismatch problem by proposing a channel-aware audio separation… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  45. arXiv:2210.15368  [pdf, other

    cs.SD cs.AI cs.CL cs.LG cs.MM eess.AS

    A Training and Inference Strategy Using Noisy and Enhanced Speech as Target for Speech Enhancement without Clean Speech

    Authors: Li-Wei Chen, Yao-Fei Cheng, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang

    Abstract: The lack of clean speech is a practical challenge to the development of speech enhancement systems, which means that there is an inevitable mismatch between their training criterion and evaluation metric. In response to this unfavorable situation, we propose a training and inference strategy that additionally uses enhanced speech as a target by improving the previously proposed noisy-target traini… ▽ More

    Submitted 22 May, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Accepted by Interspeech 2023

  46. arXiv:2210.13271  [pdf, other

    eess.SP cs.LG

    ECG Artifact Removal from Single-Channel Surface EMG Using Fully Convolutional Networks

    Authors: Kuan-Chen Wang, Kai-Chun Liu, Sheng-Yu Peng, Yu Tsao

    Abstract: Electrocardiogram (ECG) artifact contamination often occurs in surface electromyography (sEMG) applications when the measured muscles are in proximity to the heart. Previous studies have developed and proposed various methods, such as high-pass filtering, template subtraction and so forth. However, these methods remain limited by the requirement of reference signals and distortion of original sEMG… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: 5 pages, 5 figures

  47. arXiv:2209.10446  [pdf, ps, other

    eess.AS cs.SD eess.SP

    Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic Wasserstein GAN

    Authors: Yin-** Cho, Yu Tsao, Hsin-Min Wang, Yi-Wen Liu

    Abstract: Singing voice synthesis (SVS) is the computer production of a human-like singing voice from given musical scores. To accomplish end-to-end SVS effectively and efficiently, this work adopts the acoustic model-neural vocoder architecture established for high-quality speech and singing voice synthesis. Specifically, this work aims to pursue a higher level of expressiveness in synthesized voices by co… ▽ More

    Submitted 21 September, 2022; originally announced September 2022.

  48. arXiv:2207.09514  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

    Authors: Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

    Abstract: This paper presents recent progress on integrating speech separation and enhancement (SSE) into the ESPnet toolkit. Compared with the previous ESPnet-SE work, numerous features have been added, including recent state-of-the-art speech enhancement models with their respective training and evaluation recipes. Importantly, a new interface has been designed to flexibly combine speech enhancement front… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

    Comments: To appear in Interspeech 2022

  49. arXiv:2206.09058  [pdf, other

    eess.AS cs.LG

    NASTAR: Noise Adaptive Speech Enhancement with Target-Conditional Resampling

    Authors: Chi-Chang Lee, Cheng-Hung Hu, Yu-Chen Lin, Chu-Song Chen, Hsin-Min Wang, Yu Tsao

    Abstract: For deep learning-based speech enhancement (SE) systems, the training-test acoustic mismatch can cause notable performance degradation. To address the mismatch issue, numerous noise adaptation strategies have been derived. In this paper, we propose a novel method, called noise adaptive speech enhancement with target-conditional resampling (NASTAR), which reduces mismatches with only one sample (on… ▽ More

    Submitted 17 June, 2022; originally announced June 2022.

    Comments: Accepted to Interspeech 2022

  50. arXiv:2206.07860  [pdf, other

    cs.SD cs.LG eess.AS

    EPG2S: Speech Generation and Speech Enhancement based on Electropalatography and Audio Signals using Multimodal Learning

    Authors: Li-Chin Chen, Po-Hsun Chen, Richard Tzong-Han Tsai, Yu Tsao

    Abstract: Speech generation and enhancement based on articulatory movements facilitate communication when the scope of verbal communication is absent, e.g., in patients who have lost the ability to speak. Although various techniques have been proposed to this end, electropalatography (EPG), which is a monitoring technique that records contact between the tongue and hard palate during speech, has not been ad… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: Accepted By IEEE Signal Processing Letter

    Journal ref: IEEE Signal Processing Letters, vol. 29, p. 2582-2586, 2022