Skip to main content

Showing 1–14 of 14 results for author: Hung, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.16321  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

    Authors: Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang

    Abstract: Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variatio… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: Published as a conference paper at ICLR 2024

  2. arXiv:2307.04517  [pdf, other

    eess.AS

    Study on the Correlation between Objective Evaluations and Subjective Speech Quality and Intelligibility

    Authors: Hsin-Tien Chiang, Kuo-Hsuan Hung, Szu-Wei Fu, Heng-Cheng Kuo, Ming-Hsueh Tsai, Yu Tsao

    Abstract: Subjective tests are the gold standard for evaluating speech quality and intelligibility; however, they are time-consuming and expensive. Thus, objective measures that align with human perceptions are crucial. This study evaluates the correlation between commonly used objective measures and subjective speech quality and intelligibility using a Chinese speech dataset. Moreover, new objective measur… ▽ More

    Submitted 10 October, 2023; v1 submitted 10 July, 2023; originally announced July 2023.

  3. arXiv:2210.17456  [pdf, other

    eess.AS cs.SD

    Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings

    Authors: I-Chun Chern, Kuo-Hsuan Hung, Yi-Ting Chen, Tassadaq Hussain, Mandar Gogate, Amir Hussain, Yu Tsao, Jen-Cheng Hou

    Abstract: AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-moda… ▽ More

    Submitted 31 May, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: ICASSP AMHAT 2023

  4. arXiv:2204.03339  [pdf, other

    eess.AS

    Boosting Self-Supervised Embeddings for Speech Enhancement

    Authors: Kuo-Hsuan Hung, Szu-wei Fu, Huan-Hsin Tseng, Hsin-Tien Chiang, Yu Tsao, Chii-Wann Lin

    Abstract: Self-supervised learning (SSL) representation for speech has achieved state-of-the-art (SOTA) performance on several downstream tasks. However, there remains room for improvement in speech enhancement (SE) tasks. In this study, we used a cross-domain feature to solve the problem that SSL embeddings may lack fine-grained information to regenerate speech signals. By integrating the SSL representatio… ▽ More

    Submitted 5 July, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: accepted to INTERSPEECH-2022

  5. arXiv:2202.06684  [pdf, other

    eess.AS cs.LG cs.SD

    Partially Fake Audio Detection by Self-attention-based Fake Span Discovery

    Authors: Haibin Wu, Heng-Cheng Kuo, Naijun Zheng, Kuo-Hsuan Hung, Hung-Yi Lee, Yu Tsao, Hsin-Min Wang, Helen Meng

    Abstract: The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios by advanced speech synthesis and voice conversion m… ▽ More

    Submitted 15 February, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

    Comments: Submitted to ICASSP 2022

  6. arXiv:2110.05866  [pdf

    cs.SD cs.CL eess.AS

    MetricGAN-U: Unsupervised speech enhancement/ dereverberation based only on noisy/ reverberated speech

    Authors: Szu-Wei Fu, Cheng Yu, Kuo-Hsuan Hung, Mirco Ravanelli, Yu Tsao

    Abstract: Most of the deep learning-based speech enhancement models are learned in a supervised manner, which implies that pairs of noisy and clean speech are required during training. Consequently, several noisy speeches recorded in daily life cannot be used to train the model. Although certain unsupervised learning frameworks have also been proposed to solve the pair constraint, they still require clean s… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

  7. arXiv:2106.05229  [pdf, other

    cs.SD cs.LG eess.AS

    Speech Recovery for Real-World Self-powered Intermittent Devices

    Authors: Yu-Chen Lin, Tsun-An Hsieh, Kuo-Hsuan Hung, Cheng Yu, Harinath Garudadri, Yu Tsao, Tei-Wei Kuo

    Abstract: The incompleteness of speech inputs severely degrades the performance of all the related speech signal processing applications. Although many researches have been proposed to address this issue, they controlled the data missing conditions by simulation with self-defined masking lengths or sizes. Besides, the masking definitions are different among all these experimental settings. This paper presen… ▽ More

    Submitted 24 January, 2022; v1 submitted 9 June, 2021; originally announced June 2021.

  8. arXiv:2102.03786  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

    Authors: Yu-Wen Chen, Kuo-Hsuan Hung, Shang-Yi Chuang, Jonathan Sherman, Wen-Chin Huang, Xugang Lu, Yu Tsao

    Abstract: Synthesized speech from articulatory movements can have real-world use for patients with vocal cord disorders, situations requiring silent speech, or in high-noise environments. In this work, we present EMA2S, an end-to-end multimodal articulatory-to-speech system that directly converts articulatory movements to speech signals. We use a neural-network-based vocoder combined with multimodal joint-t… ▽ More

    Submitted 9 June, 2021; v1 submitted 7 February, 2021; originally announced February 2021.

  9. arXiv:2012.03426  [pdf

    eess.SP cs.LG

    Deep Learning Based Signal Enhancement of Low-Resolution Accelerometer for Fall Detection Systems

    Authors: Kai-Chun Liu, Kuo-Hsuan Hung, Chia-Yeh Hsieh, Hsiang-Yun Huang, Chia-Tai Chan, Yu Tsao

    Abstract: In the last two decades, fall detection (FD) systems have been developed as a popular assistive technology. Such systems automatically detect critical fall events and immediately alert medical professionals or caregivers. To support long-term FD services, various power-saving strategies have been implemented. Among them, a reduced sampling rate is a common approach for an energy-efficient system i… ▽ More

    Submitted 27 September, 2021; v1 submitted 6 December, 2020; originally announced December 2020.

    Comments: Accepted by IEEE Transactions on Cognitive and Developmental Systems, 12 pages, 7 figures, 8 tables

  10. arXiv:2011.01691  [pdf, other

    eess.AS

    A Study of Incorporating Articulatory Movement Information in Speech Enhancement

    Authors: Yu-Wen Chen, Kuo-Hsuan Hung, Shang-Yi Chuang, Jonathan Sherman, Xugang Lu, Yu Tsao

    Abstract: Although deep learning algorithms are widely used for improving speech enhancement (SE) performance, the performance remains limited under highly challenging conditions, such as unseen noise or noise signals having low signal-to-noise ratios (SNRs). This study provides a pilot investigation on a novel multimodal audio-articulatory-movement SE (AAMSE) model to enhance SE performance under such chal… ▽ More

    Submitted 9 June, 2021; v1 submitted 3 November, 2020; originally announced November 2020.

  11. arXiv:2008.09264  [pdf, other

    eess.AS cs.LG cs.SD

    CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

    Authors: Yu-Wen Chen, Kuo-Hsuan Hung, You-** Li, Alexander Chao-Fu Kang, Ya-Hsin Lai, Kai-Chun Liu, Szu-Wei Fu, Syu-Siang Wang, Yu Tsao

    Abstract: This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), allowing CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, a… ▽ More

    Submitted 25 April, 2022; v1 submitted 20 August, 2020; originally announced August 2020.

  12. arXiv:2006.11139  [pdf, other

    eess.AS

    Waveform-based Voice Activity Detection Exploiting Fully Convolutional networks with Multi-Branched Encoders

    Authors: Cheng Yu, Kuo-Hsuan Hung, I-Fan Lin, Szu-Wei Fu, Yu Tsao, Jeih-weih Hung

    Abstract: In this study, we propose an encoder-decoder structured system with fully convolutional networks to implement voice activity detection (VAD) directly on the time-domain waveform. The proposed system processes the input waveform to identify its segments to be either speech or non-speech. This novel waveform-based VAD algorithm, with a short-hand notation "WVAD", has two main particularities. First,… ▽ More

    Submitted 19 June, 2020; originally announced June 2020.

  13. arXiv:2006.10296  [pdf

    eess.AS cs.LG cs.SD

    Boosting Objective Scores of a Speech Enhancement Model by MetricGAN Post-processing

    Authors: Szu-Wei Fu, Chien-Feng Liao, Tsun-An Hsieh, Kuo-Hsuan Hung, Syu-Siang Wang, Cheng Yu, Heng-Cheng Kuo, Ryandhimas E. Zezario, You-** Li, Shang-Yi Chuang, Yen-Ju Lu, Yu Tsao

    Abstract: The Transformer architecture has demonstrated a superior ability compared to recurrent neural networks in many different natural language processing applications. Therefore, our study applies a modified Transformer in a speech enhancement task. Specifically, positional encoding in the Transformer may not be necessary for speech enhancement, and hence, it is replaced by convolutional layers. To fur… ▽ More

    Submitted 3 March, 2021; v1 submitted 18 June, 2020; originally announced June 2020.

    Comments: Accepted by APSIPA 2020

  14. arXiv:1911.09847  [pdf, ps, other

    eess.AS cs.SD eess.SP

    Time-Domain Multi-modal Bone/air Conducted Speech Enhancement

    Authors: Cheng Yu, Kuo-Hsuan Hung, Syu-Siang Wang, Szu-Wei Fu, Yu Tsao, Jeih-weih Hung

    Abstract: Previous studies have proven that integrating video signals, as a complementary modality, can facilitate improved performance for speech enhancement (SE). However, video clips usually contain large amounts of data and pose a high cost in terms of computational resources and thus may complicate the SE system. As an alternative source, a bone-conducted speech signal has a moderate data size while ma… ▽ More

    Submitted 17 June, 2020; v1 submitted 21 November, 2019; originally announced November 2019.

    Comments: multi-modal, bone/air-conducted signals, speech enhancement, fully convolutional network

    Journal ref: IEEE Signal Processing Letters, vol. 27, pp. 1035-1039, 2020