-
Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation
Authors:
Tsun-An Hsieh,
Heeyoul Choi,
Minje Kim
Abstract:
Recent studies highlight the potential of textual modalities in conditioning the speech separation model's inference process. However, regularization-based methods remain underexplored despite their advantages of not requiring auxiliary text data during the test time. To address this gap, we introduce a timed text-based regularization (TTR) method that uses language model-derived semantics to impr…
▽ More
Recent studies highlight the potential of textual modalities in conditioning the speech separation model's inference process. However, regularization-based methods remain underexplored despite their advantages of not requiring auxiliary text data during the test time. To address this gap, we introduce a timed text-based regularization (TTR) method that uses language model-derived semantics to improve speech separation models. Our approach involves two steps. We begin with two pretrained audio and language models, WavLM and BERT, respectively. Then, a Transformer-based audio summarizer is learned to align the audio and word embeddings and to minimize their gap. The summarizer Transformer, incorporated as a regularizer, promotes the separated sources' alignment with the semantics from the timed text. Experimental results show that the proposed TTR method consistently improves the various objective metrics of the separation results over the unregularized baselines.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
On the Importance of Neural Wiener Filter for Resource Efficient Multichannel Speech Enhancement
Authors:
Tsun-An Hsieh,
Jacob Donley,
Daniel Wong,
Buye Xu,
Ashutosh Pandey
Abstract:
We introduce a time-domain framework for efficient multichannel speech enhancement, emphasizing low latency and computational efficiency. This framework incorporates two compact deep neural networks (DNNs) surrounding a multichannel neural Wiener filter (NWF). The first DNN enhances the speech signal to estimate NWF coefficients, while the second DNN refines the output from the NWF. The NWF, while…
▽ More
We introduce a time-domain framework for efficient multichannel speech enhancement, emphasizing low latency and computational efficiency. This framework incorporates two compact deep neural networks (DNNs) surrounding a multichannel neural Wiener filter (NWF). The first DNN enhances the speech signal to estimate NWF coefficients, while the second DNN refines the output from the NWF. The NWF, while conceptually similar to the traditional frequency-domain Wiener filter, undergoes a training process optimized for low-latency speech enhancement, involving fine-tuning of both analysis and synthesis transforms. Our research results illustrate that the NWF output, having minimal nonlinear distortions, attains performance levels akin to those of the first DNN, deviating from conventional Wiener filter paradigms. Training all components jointly outperforms sequential training, despite its simplicity. Consequently, this framework achieves superior performance with fewer parameters and reduced computational demands, making it a compelling solution for resource-efficient multichannel speech enhancement.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
TMSR: Tiny Multi-path CNNs for Super Resolution
Authors:
Chia-Hung Liu,
Tzu-Hsin Hsieh,
Kuan-Yu Huang,
Pei-Yin Chen
Abstract:
In this paper, we proposed a tiny multi-path CNN-based Super-Resolution (SR) method, called TMSR. We mainly refer to some tiny CNN-based SR methods, under 5k parameters. The main contribution of the proposed method is the improved multi-path learning and self-defined activated function. The experimental results show that TMSR obtains competitive image quality (i.e. PSNR and SSIM) compared to the r…
▽ More
In this paper, we proposed a tiny multi-path CNN-based Super-Resolution (SR) method, called TMSR. We mainly refer to some tiny CNN-based SR methods, under 5k parameters. The main contribution of the proposed method is the improved multi-path learning and self-defined activated function. The experimental results show that TMSR obtains competitive image quality (i.e. PSNR and SSIM) compared to the related works under 5k parameters.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
Inference and Denoise: Causal Inference-based Neural Speech Enhancement
Authors:
Tsun-An Hsieh,
Chao-Han Huck Yang,
Pin-Yu Chen,
Sabato Marco Siniscalchi,
Yu Tsao
Abstract:
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement module…
▽ More
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE. Specifically, we use the presence of noise as guidance for EM selection during training, and the noise detector selects the enhancement module according to the prediction of the presence of noise for each frame. Moreover, we derived a SE-specific average treatment effect to quantify the causal effect adequately. Experimental evidence demonstrates that CISE outperforms a non-causal mask-based SE approach in the studied settings and has better performance and efficiency than more complex SE models.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
towards automatic transcription of polyphonic electric guitar music:a new dataset and a multi-loss transformer model
Authors:
Yu-Hua Chen,
Wen-Yi Hsiao,
Tsu-Kuang Hsieh,
Jyh-Shing Roger Jang,
Yi-Hsuan Yang
Abstract:
In this paper, we propose a new dataset named EGDB, that con-tains transcriptions of the electric guitar performance of 240 tab-latures rendered with different tones. Moreover, we benchmark theperformance of two well-known transcription models proposed orig-inally for the piano on this dataset, along with a multi-loss Trans-former model that we newly propose. Our evaluation on this datasetand a se…
▽ More
In this paper, we propose a new dataset named EGDB, that con-tains transcriptions of the electric guitar performance of 240 tab-latures rendered with different tones. Moreover, we benchmark theperformance of two well-known transcription models proposed orig-inally for the piano on this dataset, along with a multi-loss Trans-former model that we newly propose. Our evaluation on this datasetand a separate set of real-world recordings demonstrate the influenceof timbre on the accuracy of guitar sheet transcription, the potentialof using multiple losses for Transformers, as well as the room forfurther improvement for this task.
△ Less
Submitted 20 February, 2022;
originally announced February 2022.
-
Design of Sensor Fusion Driver Assistance System for Active Pedestrian Safety
Authors:
I-Hsi Kao,
Ya-Zhu Yian,
Jian-An Su,
Yi-Horng Lai,
Jau-Woei Perng,
Tung-Li Hsieh,
Yi-Shueh Tsai,
Min-Shiu Hsieh
Abstract:
In this paper, we present a parallel architecture for a sensor fusion detection system that combines a camera and 1D light detection and ranging (lidar) sensor for object detection. The system contains two object detection methods, one based on an optical flow, and the other using lidar. The two sensors can effectively complement the defects of the other. The accurate longitudinal accuracy of the…
▽ More
In this paper, we present a parallel architecture for a sensor fusion detection system that combines a camera and 1D light detection and ranging (lidar) sensor for object detection. The system contains two object detection methods, one based on an optical flow, and the other using lidar. The two sensors can effectively complement the defects of the other. The accurate longitudinal accuracy of the object's location and its lateral movement information can be achieved simultaneously. Using a spatio-temporal alignment and a policy of sensor fusion, we completed the development of a fusion detection system with high reliability at distances of up to 20 m. Test results show that the proposed system achieves a high level of accuracy for pedestrian or object detection in front of a vehicle, and has high robustness to special environments.
△ Less
Submitted 23 January, 2022;
originally announced January 2022.
-
OSSEM: one-shot speaker adaptive speech enhancement using meta learning
Authors:
Cheng Yu,
Szu-Wei Fu,
Tsun-An Hsieh,
Yu Tsao,
Mirco Ravanelli
Abstract:
Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE system to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a one-shot manner. OSSEM consists of a modified tra…
▽ More
Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE system to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a one-shot manner. OSSEM consists of a modified transformer SE network and a speaker-specific masking (SSM) network. In practice, the SSM network takes an enrolled speaker embedding extracted using ECAPA-TDNN to adjust the input noisy feature through masking. To evaluate OSSEM, we designed a modified Voice Bank-DEMAND dataset, in which one utterance from the testing set was used for model adaptation, and the remaining utterances were used for testing the performance. Moreover, we set restrictions allowing the enhancement process to be conducted in real time, and thus designed OSSEM to be a causal SE system. Experimental results first show that OSSEM can effectively adapt a pretrained SE model to a particular speaker with only one utterance, thus yielding improved SE results. Meanwhile, OSSEM exhibits a competitive performance compared to state-of-the-art causal SE systems.
△ Less
Submitted 10 November, 2021;
originally announced November 2021.
-
Speech Recovery for Real-World Self-powered Intermittent Devices
Authors:
Yu-Chen Lin,
Tsun-An Hsieh,
Kuo-Hsuan Hung,
Cheng Yu,
Harinath Garudadri,
Yu Tsao,
Tei-Wei Kuo
Abstract:
The incompleteness of speech inputs severely degrades the performance of all the related speech signal processing applications. Although many researches have been proposed to address this issue, they controlled the data missing conditions by simulation with self-defined masking lengths or sizes. Besides, the masking definitions are different among all these experimental settings. This paper presen…
▽ More
The incompleteness of speech inputs severely degrades the performance of all the related speech signal processing applications. Although many researches have been proposed to address this issue, they controlled the data missing conditions by simulation with self-defined masking lengths or sizes. Besides, the masking definitions are different among all these experimental settings. This paper presents a novel intermittent speech recovery (ISR) system for real-world self-powered intermittent devices. Three contributive stages: interpolation, enhancement, and combination are applied to the ISR system for speech reconstruction. The experimental results show that our recovery system increases speech quality by up to 591.7%, while increasing speech intelligibility by up to 80.5%. Most importantly, the proposed ISR system improves the WER scores by up to 52.6%. The promising results not only confirm the effectiveness of the reconstruction but also encourage the utilization of these battery-free wearable/IoT devices.
△ Less
Submitted 24 January, 2022; v1 submitted 9 June, 2021;
originally announced June 2021.
-
MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement
Authors:
Szu-Wei Fu,
Cheng Yu,
Tsun-An Hsieh,
Peter Plantinga,
Mirco Ravanelli,
Xugang Lu,
Yu Tsao
Abstract:
The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory. Objective evaluation metrics which consider human perception can hence serve as a bridge to reduce the gap. Our previously proposed MetricGAN was designed to optimize objective metrics by connecting the metric with a discr…
▽ More
The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory. Objective evaluation metrics which consider human perception can hence serve as a bridge to reduce the gap. Our previously proposed MetricGAN was designed to optimize objective metrics by connecting the metric with a discriminator. Because only the scores of the target evaluation functions are needed during training, the metrics can even be non-differentiable. In this study, we propose a MetricGAN+ in which three training techniques incorporating domain-knowledge of speech processing are proposed. With these techniques, experimental results on the VoiceBank-DEMAND dataset show that MetricGAN+ can increase PESQ score by 0.3 compared to the previous MetricGAN and achieve state-of-the-art results (PESQ score = 3.15).
△ Less
Submitted 4 June, 2021; v1 submitted 8 April, 2021;
originally announced April 2021.
-
Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement
Authors:
Tsun-An Hsieh,
Cheng Yu,
Szu-Wei Fu,
Xugang Lu,
Yu Tsao
Abstract:
Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e.g. phones and syllables. In this study, we propose a novel phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models. To effectively incorporate the phonetic information…
▽ More
Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e.g. phones and syllables. In this study, we propose a novel phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models. To effectively incorporate the phonetic information, the PFPL is computed based on latent representations of the wav2vec model, a powerful self-supervised encoder that renders rich phonetic information. To more accurately measure the distribution distances of the latent representations, the PFPL adopts the Wasserstein distance as the distance measure. Our experimental results first reveal that the PFPL is more correlated with the perceptual evaluation metrics, as compared to signal-level losses. Moreover, the results showed that the PFPL can enable a deep complex U-Net SE model to achieve highly competitive performance in terms of standardized quality and intelligibility evaluations on the Voice Bank-DEMAND dataset.
△ Less
Submitted 27 April, 2021; v1 submitted 28 October, 2020;
originally announced October 2020.
-
Boosting Objective Scores of a Speech Enhancement Model by MetricGAN Post-processing
Authors:
Szu-Wei Fu,
Chien-Feng Liao,
Tsun-An Hsieh,
Kuo-Hsuan Hung,
Syu-Siang Wang,
Cheng Yu,
Heng-Cheng Kuo,
Ryandhimas E. Zezario,
You-** Li,
Shang-Yi Chuang,
Yen-Ju Lu,
Yu Tsao
Abstract:
The Transformer architecture has demonstrated a superior ability compared to recurrent neural networks in many different natural language processing applications. Therefore, our study applies a modified Transformer in a speech enhancement task. Specifically, positional encoding in the Transformer may not be necessary for speech enhancement, and hence, it is replaced by convolutional layers. To fur…
▽ More
The Transformer architecture has demonstrated a superior ability compared to recurrent neural networks in many different natural language processing applications. Therefore, our study applies a modified Transformer in a speech enhancement task. Specifically, positional encoding in the Transformer may not be necessary for speech enhancement, and hence, it is replaced by convolutional layers. To further improve the perceptual evaluation of the speech quality (PESQ) scores of enhanced speech, the L_1 pre-trained Transformer is fine-tuned using a MetricGAN framework. The proposed MetricGAN can be treated as a general post-processing module to further boost the objective scores of interest. The experiments were conducted using the data sets provided by the organizer of the Deep Noise Suppression (DNS) challenge. Experimental results demonstrated that the proposed system outperformed the challenge baseline, in both subjective and objective evaluations, with a large margin.
△ Less
Submitted 3 March, 2021; v1 submitted 18 June, 2020;
originally announced June 2020.
-
WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-end Speech Enhancement
Authors:
Tsun-An Hsieh,
Hsin-Min Wang,
Xugang Lu,
Yu Tsao
Abstract:
Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the locality and temporal sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered or are too co…
▽ More
Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the locality and temporal sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered or are too complex to be realized. In this paper, we propose an efficient E2E SE model, termed WaveCRN. In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU). Unlike a conventional temporal sequential model that uses a long short-term memory (LSTM) network, which is difficult to parallelize, SRU can be efficiently parallelized in calculation with even fewer model parameters. In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers; this is different from the approach that applies the estimated ratio mask on the noisy spectral features, which is commonly used in speech separation methods. Experimental results on speech denoising and compressed speech restoration tasks confirm that with the lightweight architecture of SRU and the feature-map**-based RFM, WaveCRN performs comparably with other state-of-the-art approaches with notably reduced model complexity and inference time.
△ Less
Submitted 26 November, 2020; v1 submitted 6 April, 2020;
originally announced April 2020.
-
Addressing the confounds of accompaniments in singer identification
Authors:
Tsung-Han Hsieh,
Kai-Hsiang Cheng,
Zhe-Cheng Fan,
Yu-Ching Yang,
Yi-Hsuan Yang
Abstract:
Identifying singers is an important task with many applications. However, the task remains challenging due to many issues. One major issue is related to the confounding factors from the background instrumental music that is mixed with the vocals in music production. A singer identification model may learn to extract non-vocal related features from the instrumental part of the songs, if a singer on…
▽ More
Identifying singers is an important task with many applications. However, the task remains challenging due to many issues. One major issue is related to the confounding factors from the background instrumental music that is mixed with the vocals in music production. A singer identification model may learn to extract non-vocal related features from the instrumental part of the songs, if a singer only sings in certain musical contexts (e.g., genres). The model cannot therefore generalize well when the singer sings in unseen contexts. In this paper, we attempt to address this issue. Specifically, we employ open-unmix, an open source tool with state-of-the-art performance in source separation, to separate the vocal and instrumental tracks of music. We then investigate two means to train a singer identification model: by learning from the separated vocal only, or from an augmented set of data where we "shuffle-and-remix" the separated vocal tracks and instrumental tracks of different songs to artificially make the singers sing in different contexts. We also incorporate melodic features learned from the vocal melody contour for better performance. Evaluation results on a benchmark dataset called the artist20 shows that this data augmentation method greatly improves the accuracy of singer identification.
△ Less
Submitted 17 February, 2020;
originally announced February 2020.
-
One-Shot Object Detection with Co-Attention and Co-Excitation
Authors:
Ting-I Hsieh,
Yi-Chen Lo,
Hwann-Tzong Chen,
Tyng-Luh Liu
Abstract:
This paper aims to tackle the challenging problem of one-shot object detection. Given a query image patch whose class label is not included in the training data, the goal of the task is to detect all instances of the same class in a target image. To this end, we develop a novel {\em co-attention and co-excitation} (CoAE) framework that makes contributions in three key technical aspects. First, we…
▽ More
This paper aims to tackle the challenging problem of one-shot object detection. Given a query image patch whose class label is not included in the training data, the goal of the task is to detect all instances of the same class in a target image. To this end, we develop a novel {\em co-attention and co-excitation} (CoAE) framework that makes contributions in three key technical aspects. First, we propose to use the non-local operation to explore the co-attention embodied in each query-target pair and yield region proposals accounting for the one-shot situation. Second, we formulate a squeeze-and-co-excitation scheme that can adaptively emphasize correlated feature channels to help uncover relevant proposals and eventually the target objects. Third, we design a margin-based ranking loss for implicitly learning a metric to predict the similarity of a region proposal to the underlying query, no matter its class label is seen or unseen in training. The resulting model is therefore a two-stage detector that yields a strong baseline on both VOC and MS-COCO under one-shot setting of detecting objects from both seen and never-seen classes. Codes are available at https://github.com/timy90022/One-Shot-Object-Detection.
△ Less
Submitted 28 November, 2019;
originally announced November 2019.
-
Adaptive Subspace Sampling for Class Imbalance Processing-Some clarifications, algorithm, and further investigation including applications to Brain Computer Interface
Authors:
Chin-Teng Lin,
Kuan-Chih Huang,
Yu-Ting Liu,
Yang-Yin Lin,
Tsung-Yu Hsieh,
Nikhil R. Pal,
Shang-Lin Wu,
Chieh-Ning Fang,
Zehong Cao
Abstract:
Kohonen's Adaptive Subspace Self-Organizing Map (ASSOM) learns several subspaces of the data where each subspace represents some invariant characteristics of the data. To deal with the imbalance classification problem, earlier we have proposed a method for oversampling the minority class using Kohonen's ASSOM. This investigation extends that study, clarifies some issues related to our earlier work…
▽ More
Kohonen's Adaptive Subspace Self-Organizing Map (ASSOM) learns several subspaces of the data where each subspace represents some invariant characteristics of the data. To deal with the imbalance classification problem, earlier we have proposed a method for oversampling the minority class using Kohonen's ASSOM. This investigation extends that study, clarifies some issues related to our earlier work, provides the algorithm for generation of the oversamples, applies the method on several benchmark data sets, and makes application to three Brain Computer Interface (BCI) applications. First we compare the performance of our method using some benchmark data sets with several state-of-the-art methods. Finally, we apply the ASSOM-based technique to analyze the three BCI based applications using electroencephalogram (EEG) datasets. These tasks are classification of motor imagery , drivers' fatigue states, and phases of migraine. Our results demonstrate the effectiveness of the ASSOM-based meth od in dealing with imbalance classification problem.
△ Less
Submitted 7 October, 2020; v1 submitted 26 May, 2019;
originally announced June 2019.
-
A Streamlined Encoder/Decoder Architecture for Melody Extraction
Authors:
Tsung-Han Hsieh,
Li Su,
Yi-Hsuan Yang
Abstract:
Melody extraction in polyphonic musical audio is important for music signal processing. In this paper, we propose a novel streamlined encoder/decoder network that is designed for the task. We make two technical contributions. First, drawing inspiration from a state-of-the-art model for semantic pixel-wise segmentation, we pass through the pooling indices between pooling and un-pooling layers to lo…
▽ More
Melody extraction in polyphonic musical audio is important for music signal processing. In this paper, we propose a novel streamlined encoder/decoder network that is designed for the task. We make two technical contributions. First, drawing inspiration from a state-of-the-art model for semantic pixel-wise segmentation, we pass through the pooling indices between pooling and un-pooling layers to localize the melody in frequency. We can achieve result close to the state-of-the-art with much fewer convolutional layers and simpler convolution modules. Second, we propose a way to use the bottleneck layer of the network to estimate the existence of a melody line for each time frame, and make it possible to use a simple argmax function instead of ad-hoc thresholding to get the final estimation of the melody line. Our experiments on both vocal melody extraction and general melody extraction validate the effectiveness of the proposed model.
△ Less
Submitted 18 February, 2019; v1 submitted 30 October, 2018;
originally announced October 2018.