Skip to main content

Showing 1–22 of 22 results for author: Hsu, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.09345  [pdf, other

    cs.CL cs.SD eess.AS

    DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding

    Authors: Suwon Shon, Kwangyoun Kim, Yi-Te Hsu, Prashant Sridhar, Shinji Watanabe, Karen Livescu

    Abstract: The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to t… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  2. arXiv:2405.08742  [pdf

    eess.AS cs.SD

    A tunable binaural audio telepresence system capable of balancing immersive and enhanced modes

    Authors: Yicheng Hsu, Mingsian R. Bai

    Abstract: Binaural Audio Telepresence (BAT) aims to encode the acoustic scene at the far end into binaural signals for the user at the near end. BAT encompasses an immense range of applications that can vary between two extreme modes of Immersive BAT (I-BAT) and Enhanced BAT (E-BAT). With I-BAT, our goal is to preserve the full ambience as if we were at the far end, while with E-BAT, our goal is to enhance… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: 5 pages, 4 figures

  3. arXiv:2401.16850  [pdf

    eess.AS cs.SD

    Spatial-Temporal Activity-Informed Diarization and Separation

    Authors: Yicheng Hsu, Ssuhan Chen, Mingsian R. Bai

    Abstract: A robust multichannel speaker diarization and separation system is proposed by exploiting the spatio-temporal activity of the speakers. The system is realized in a hybrid architecture that combines the array signal processing units and the deep learning units. For speaker diarization, a spatial coherence matrix across time frames is computed based on the whitened relative transfer functions (wRTFs… ▽ More

    Submitted 30 January, 2024; originally announced January 2024.

    Comments: 13 pages

  4. arXiv:2312.09895  [pdf, other

    cs.CL cs.SD eess.AS

    Generative Context-aware Fine-tuning of Self-supervised Speech Models

    Authors: Suwon Shon, Kwangyoun Kim, Prashant Sridhar, Yi-Te Hsu, Shinji Watanabe, Karen Livescu

    Abstract: When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, L… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  5. arXiv:2311.12706  [pdf

    eess.AS

    Learning-based Array Configuration-Independent Binaural Audio Telepresence with Scalable Signal Enhancement and Ambience Preservation

    Authors: Yicheng Hsu, Mingsian R. Bai

    Abstract: Audio Telepresence (AT) aims to create an immersive experience of the audio scene at the far end for the user(s) at the near end. The application of AT could encompass scenarios with varying degrees of emphasis on signal enhancement and ambience preservation. It is desirable for an AT system to be scalable between these two extremes. To this end, we propose an array-based Binaural AT (BAT) system… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: 10 pages, 11 figures

  6. arXiv:2310.12837  [pdf

    eess.AS

    Deep Beamforming for Speech Enhancement and Speaker Localization with an Array Response-Aware Loss Function

    Authors: Hsinyu Chang, Yicheng Hsu, Mingsian R. Bai

    Abstract: Recent research advances in deep neural network (DNN)-based beamformers have shown great promise for speech enhancement under adverse acoustic conditions. Different network architectures and input features have been explored in estimating beamforming weights. In this paper, we propose a deep beamformer based on an efficient convolutional recurrent network (CRN) trained with a novel ARray RespOnse-… ▽ More

    Submitted 22 October, 2023; v1 submitted 19 October, 2023; originally announced October 2023.

    Comments: 6 pages

  7. arXiv:2308.06069  [pdf, other

    cs.SE cs.LG cs.LO eess.SY

    Safeguarding Learning-based Control for Smart Energy Systems with Sampling Specifications

    Authors: Chih-Hong Cheng, Venkatesh Prasad Venkataramanan, Pragya Kirti Gupta, Yun-Fei Hsu, Simon Burton

    Abstract: We study challenges using reinforcement learning in controlling energy systems, where apart from performance requirements, one has additional safety requirements such as avoiding blackouts. We detail how these safety requirements in real-time temporal logic can be strengthened via discretization into linear temporal logic (LTL), such that the satisfaction of the LTL formulae implies the satisfacti… ▽ More

    Submitted 11 August, 2023; originally announced August 2023.

  8. arXiv:2304.08887  [pdf

    eess.AS

    Array Configuration-Agnostic Personal Voice Activity Detection Based on Spatial Coherence

    Authors: Yicheng Hsu, Mingsian R. Bai

    Abstract: Personal voice activity detection has received increased attention due to the growing popularity of personal mobile devices and smart speakers. PVAD is often an integral element to speech enhancement and recognition for these applications in which lightweight signal processing is only enabled for the target user. However, in real-world scenarios, the detection performance may degrade because of co… ▽ More

    Submitted 18 April, 2023; originally announced April 2023.

    Comments: Accepted by INTER-NOISE 2023. arXiv admin note: text overlap with arXiv:2211.08748

  9. arXiv:2303.06867  [pdf

    eess.AS

    Learning-based Robust Speaker Counting and Separation with the Aid of Spatial Coherence

    Authors: Yicheng Hsu, Mingsian Bai

    Abstract: A three-stage approach is proposed for speaker counting and speech separation in noisy and reverberant environments. In the spatial feature extraction, a spatial coherence matrix (SCM) is computed using whitened relative transfer functions (wRTFs) across time frames. The global activity functions of each speaker are estimated from a simplex constructed using the eigenvectors of the SCM, while the… ▽ More

    Submitted 7 August, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

    Comments: 20 pages, 17 figures

  10. arXiv:2302.08130  [pdf, other

    cs.SD cs.LG eess.AS

    Personalized Audio Quality Preference Prediction

    Authors: Chung-Che Wang, Yu-Chun Lin, Yu-Teng Hsu, Jyh-Shing Roger Jang

    Abstract: This paper proposes to use both audio input and subject information to predict the personalized preference of two audio segments with the same content in different qualities. A siamese network is used to compare the inputs and predict the preference. Several different structures for each side of the siamese network are investigated, and an LDNet with PANNs' CNN6 as the encoder and a multi-layer pe… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

  11. arXiv:2211.08748  [pdf

    eess.AS cs.SD

    Array Configuration-Agnostic Personalized Speech Enhancement using Long-Short-Term Spatial Coherence

    Authors: Yicheng Hsu, Yonghan Lee, Mingsian R. Bai

    Abstract: Personalized speech enhancement has been a field of active research for suppression of speechlike interferers such as competing speakers or TV dialogues. Compared with single channel approaches, multichannel PSE systems can be more effective in adverse acoustic conditions by leveraging the spatial information in microphone signals. However, the implementation of multichannel PSEs to accommodate a… ▽ More

    Submitted 16 November, 2022; originally announced November 2022.

  12. arXiv:2210.11123  [pdf

    eess.AS cs.SD

    Model-matching Principle Applied to the Design of an Array-based All-neural Binaural Rendering System for Audio Telepresence

    Authors: Yicheng Hsu, Chenghumg Ma, Mingsian R. Bai

    Abstract: Telepresence aims to create an immersive but virtual experience of the audio and visual scene at the far end for users at the near end. In this contribution, we propose an array-based binaural rendering system that converts the array microphone signals into the head-related transfer function (HRTF) filtered output signals for headphone-rendering. The proposed approach is formulated in light of a m… ▽ More

    Submitted 6 March, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

    Comments: accepted by ICASSP 2023

  13. arXiv:2207.08126  [pdf

    eess.AS cs.SD

    Multi-channel target speech enhancement based on ERB-scaled spatial coherence features

    Authors: Yicheng Hsu, Yonghan Lee, Mingsian R. Bai

    Abstract: Recently, speech enhancement technologies that are based on deep learning have received considerable research attention. If the spatial information in microphone signals is exploited, microphone arrays can be advantageous under some adverse acoustic conditions compared with single-microphone systems. However, multichannel speech enhancement is often performed in the short-time Fourier transform (S… ▽ More

    Submitted 17 July, 2022; originally announced July 2022.

    Comments: Accepted by International Congress on Acoustics (ICA) 2022. arXiv admin note: substantial text overlap with arXiv:2112.05686

  14. arXiv:2206.09728  [pdf

    eess.AS

    Multi-channel end-to-end neural network for speech enhancement, source localization, and voice activity detection

    Authors: Yuan Chen, Yicheng Hsu, Mingsian R. Bai

    Abstract: Speech enhancement and source localization has been active research for several decades with a wide range of real-world applications. Recently, the Deep Complex Convolution Recurrent network (DCCRN) has yielded impressive enhancement performance for single-channel systems. In this study, a neural beamformer consisting of a beamformer and a novel multi-channel DCCRN is proposed for speech enhanceme… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

    Comments: Accepted by ICA2022

  15. arXiv:2205.03594  [pdf

    eess.AS cs.SD

    Acoustic echo suppression using a learning-based multi-frame minimum variance distortionless response filter

    Authors: Yuefeng Tsai, Yicheng Hsu, Mingsian Bai

    Abstract: Distortion resulting from acoustic echo suppression (AES) is a common issue in full-duplex communication. To address the distortion problem, a multi-frame minimum variance distortionless response (MFMVDR) filtering technique is proposed. The MFMVDR filter with parameter estimation which was used in speech enhancement problems is extended in this study from a deep learning perspective. To alleviate… ▽ More

    Submitted 7 May, 2022; originally announced May 2022.

    Comments: Submitted to International Workshop on Acoustic Signal Enhancement (IWAENC) 2022

  16. Learning-based personal speech enhancement for teleconferencing by exploiting spatial-spectral features

    Authors: Yicheng Hsu, Yonghan Lee, Mingsian R. Bai

    Abstract: Teleconferencing is becoming essential during the COVID-19 pandemic. However, in real-world applications, speech quality can deteriorate due to, for example, background interference, noise, or reverberation. To solve this problem, target speech extraction from the mixture signals can be performed with the aid of the user's vocal features. Various features are accounted for in this study's proposed… ▽ More

    Submitted 29 April, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

    Comments: accepted by ICASSP 2022

  17. arXiv:2111.04436  [pdf, other

    cs.SD cs.LG eess.AS

    SEOFP-NET: Compression and Acceleration of Deep Neural Networks for Speech Enhancement Using Sign-Exponent-Only Floating-Points

    Authors: Yu-Chen Lin, Cheng Yu, Yi-Te Hsu, Szu-Wei Fu, Yu Tsao, Tei-Wei Kuo

    Abstract: Numerous compression and acceleration strategies have achieved outstanding results on classification tasks in various fields, such as computer vision and speech signal processing. Nevertheless, the same strategies have yielded ungratified performance on regression tasks because the nature between these and classification tasks differs. In this paper, a novel sign-exponent-only floating-point netwo… ▽ More

    Submitted 8 November, 2021; originally announced November 2021.

  18. arXiv:2002.11297  [pdf, other

    cs.CV cs.LG eess.IV

    Generalized ODIN: Detecting Out-of-distribution Image without Learning from Out-of-distribution Data

    Authors: Yen-Chang Hsu, Yilin Shen, Hongxia **, Zsolt Kira

    Abstract: Deep neural networks have attained remarkable performance when applied to data that comes from the same distribution as that of the training set, but can significantly degrade otherwise. Therefore, detecting whether an example is out-of-distribution (OoD) is crucial to enable a system that can reject such samples or alert users. Recent works have made significant progress on OoD benchmarks consist… ▽ More

    Submitted 31 March, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

    Comments: CVPR 2020

  19. arXiv:1911.12926  [pdf, other

    cs.SD cs.LG eess.AS

    J-Net: Randomly weighted U-Net for audio source separation

    Authors: Bo-Wen Chen, Yen-Min Hsu, Hung-Yi Lee

    Abstract: Several results in the computer vision literature have shown the potential of randomly weighted neural networks. While they perform fairly well as feature extractors for discriminative tasks, a positive correlation exists between their performance and their fully trained counterparts. According to these discoveries, we pose two questions: what is the value of randomly weighted networks in difficul… ▽ More

    Submitted 28 November, 2019; originally announced November 2019.

  20. arXiv:1908.00541  [pdf, other

    eess.SY

    Early Findings from Field Trials of Heavy-Duty Truck Connected Eco-Driving System

    Authors: Ziran Wang, Yuan-Pu Hsu, Alexander Vu, Francisco Caballero, Peng Hao, Guoyuan Wu, Kanok Boriboonsomsin, Matthew J. Barth, Aravind Kailas, Pascal Amar, Eddie Garmon, Sandeep Tanugula

    Abstract: In recent years, the development of connected and automated vehicle (CAV) technology has inspired numerous advanced applications targeted at improving existing transportation systems. As one of the widely studied applications of CAV technology, connected eco-driving takes advantage of Signal Phase and Timing (SPaT) information from traffic signals to enable CAVs to approach and depart from signali… ▽ More

    Submitted 31 July, 2019; originally announced August 2019.

    Comments: Earlier incorrectly submitted as a replacement of arXiv:1902.07747 (which has been reverted). To appear in 2019 IEEE International Intelligent Transportation Systems Conference (ITSC)

  21. arXiv:1811.10376  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Robustness against the channel effect in pathological voice detection

    Authors: Yi-Te Hsu, Zining Zhu, Chi-Te Wang, Shih-Hau Fang, Frank Rudzicz, Yu Tsao

    Abstract: Many people are suffering from voice disorders, which can adversely affect the quality of their lives. In response, some researchers have proposed algorithms for automatic assessment of these disorders, based on voice signals. However, these signals can be sensitive to the recording devices. Indeed, the channel effect is a pervasive problem in machine learning for healthcare. In this study, we pro… ▽ More

    Submitted 2 December, 2018; v1 submitted 26 November, 2018; originally announced November 2018.

    Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.07216

    Report number: ML4H/2018/200

  22. arXiv:1808.06474  [pdf, ps, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    A study on speech enhancement using exponent-only floating point quantized neural network (EOFP-QNN)

    Authors: Yi-Te Hsu, Yu-Chen Lin, Szu-Wei Fu, Yu Tsao, Tei-Wei Kuo

    Abstract: Numerous studies have investigated the effectiveness of neural network quantization on pattern classification tasks. The present study, for the first time, investigated the performance of speech enhancement (a regression task in speech processing) using a novel exponent-only floating-point quantized neural network (EOFP-QNN). The proposed EOFP-QNN consists of two stages: mantissa-quantization and… ▽ More

    Submitted 30 October, 2018; v1 submitted 17 August, 2018; originally announced August 2018.