-
DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding
Authors:
Suwon Shon,
Kwangyoun Kim,
Yi-Te Hsu,
Prashant Sridhar,
Shinji Watanabe,
Karen Livescu
Abstract:
The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to t…
▽ More
The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to the LLM token embedding space using the speech adapter. We generate DSU using a self-supervised speech encoder followed by k-means clustering. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. We also explore various types of DSU extracted from different layers of the self-supervised speech encoder, as well as Mel frequency Cepstral Coefficients (MFCC). Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
A tunable binaural audio telepresence system capable of balancing immersive and enhanced modes
Authors:
Yicheng Hsu,
Mingsian R. Bai
Abstract:
Binaural Audio Telepresence (BAT) aims to encode the acoustic scene at the far end into binaural signals for the user at the near end. BAT encompasses an immense range of applications that can vary between two extreme modes of Immersive BAT (I-BAT) and Enhanced BAT (E-BAT). With I-BAT, our goal is to preserve the full ambience as if we were at the far end, while with E-BAT, our goal is to enhance…
▽ More
Binaural Audio Telepresence (BAT) aims to encode the acoustic scene at the far end into binaural signals for the user at the near end. BAT encompasses an immense range of applications that can vary between two extreme modes of Immersive BAT (I-BAT) and Enhanced BAT (E-BAT). With I-BAT, our goal is to preserve the full ambience as if we were at the far end, while with E-BAT, our goal is to enhance the far-end conversation with significantly improved speech quality and intelligibility. To this end, this paper presents a tunable BAT system to vary between these two AT modes with a desired application-specific balance. Microphone signals are converted into binaural signals with prescribed ambience factor. A novel Spatial COherence REpresentation (SCORE) is proposed as an input feature for model training so that the network remains robust to different array setups. Experimental results demonstrated the superior performance of the proposed BAT, even when the array configurations were not included in the training phase.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Spatial-Temporal Activity-Informed Diarization and Separation
Authors:
Yicheng Hsu,
Ssuhan Chen,
Mingsian R. Bai
Abstract:
A robust multichannel speaker diarization and separation system is proposed by exploiting the spatio-temporal activity of the speakers. The system is realized in a hybrid architecture that combines the array signal processing units and the deep learning units. For speaker diarization, a spatial coherence matrix across time frames is computed based on the whitened relative transfer functions (wRTFs…
▽ More
A robust multichannel speaker diarization and separation system is proposed by exploiting the spatio-temporal activity of the speakers. The system is realized in a hybrid architecture that combines the array signal processing units and the deep learning units. For speaker diarization, a spatial coherence matrix across time frames is computed based on the whitened relative transfer functions (wRTFs) of the microphone array. This serves as a robust feature for subsequent machine learning without the need for prior knowledge of the array configuration. A computationally efficient Spatial Activity-driven Speaker Diarization network (SASDnet) is constructed to estimate the speaker activity directly from the spatial coherence matrix. For speaker separation, we propose the Global and Local Activity-driven Speaker Extraction network (GLASEnet) to separate speaker signals via speaker-specific global and local spatial activity functions. The local spatial activity functions depend on the coherence between the wRTFs of each time-frequency bin and the target speaker-dominant bins. The global spatial activity functions are computed from the global spatial coherence functions based on frequency-averaged local spatial activity functions. Experimental results have demonstrated superior speaker, diarization, counting, and separation performance achieved by the proposed system with low computational complexity compared to the pre-selected baselines.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
Generative Context-aware Fine-tuning of Self-supervised Speech Models
Authors:
Suwon Shon,
Kwangyoun Kim,
Prashant Sridhar,
Yi-Te Hsu,
Shinji Watanabe,
Karen Livescu
Abstract:
When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, L…
▽ More
When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, LLM could generate a prediction of the next sentence or abstractive text like titles or topics. In this paper, we study the use of LLM-generated context information and propose an approach to distill the generated information during fine-tuning of self-supervised speech models, which we refer to as generative context-aware fine-tuning. This approach allows the fine-tuned model to make improved predictions without access to the true surrounding segments or to the LLM at inference time, while requiring only a very small additional context module. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis. The results show that generative context-aware fine-tuning outperforms a context injection fine-tuning approach that accesses the ground-truth previous text, and is competitive with a generative context injection fine-tuning approach that requires the LLM at inference time.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Learning-based Array Configuration-Independent Binaural Audio Telepresence with Scalable Signal Enhancement and Ambience Preservation
Authors:
Yicheng Hsu,
Mingsian R. Bai
Abstract:
Audio Telepresence (AT) aims to create an immersive experience of the audio scene at the far end for the user(s) at the near end. The application of AT could encompass scenarios with varying degrees of emphasis on signal enhancement and ambience preservation. It is desirable for an AT system to be scalable between these two extremes. To this end, we propose an array-based Binaural AT (BAT) system…
▽ More
Audio Telepresence (AT) aims to create an immersive experience of the audio scene at the far end for the user(s) at the near end. The application of AT could encompass scenarios with varying degrees of emphasis on signal enhancement and ambience preservation. It is desirable for an AT system to be scalable between these two extremes. To this end, we propose an array-based Binaural AT (BAT) system using the DeepFilterNet as the backbone to convert the array microphone signals into the Head-Related Transfer Function (HRTF)-filtered signals, with a tunable weighting between signal enhancement and ambience preservation. An array configuration-independent Spatial COherence REpresentation (SCORE) feature is proposed for the model training so that the network remains robust to different array geometries and sensor counts. magnitude-weighted Interaural Phase Difference error (mw-IPDe), magnitude-weighted Interaural Level Difference error (mw-ILDe), and modified Scale-Invariant Signal-to-Distortion Ratio (mSI-SDR) are defined as performance metrics for objective evaluation. Subjective listening tests were also performed to validate the proposed BAT system. The results have shown that the proposed BAT system can achieve superior telepresence performance with the desired balance between signal enhancement and ambience preservation, even when the array configurations are unseen in the training phase.
△ Less
Submitted 21 November, 2023;
originally announced November 2023.
-
Deep Beamforming for Speech Enhancement and Speaker Localization with an Array Response-Aware Loss Function
Authors:
Hsinyu Chang,
Yicheng Hsu,
Mingsian R. Bai
Abstract:
Recent research advances in deep neural network (DNN)-based beamformers have shown great promise for speech enhancement under adverse acoustic conditions. Different network architectures and input features have been explored in estimating beamforming weights. In this paper, we propose a deep beamformer based on an efficient convolutional recurrent network (CRN) trained with a novel ARray RespOnse-…
▽ More
Recent research advances in deep neural network (DNN)-based beamformers have shown great promise for speech enhancement under adverse acoustic conditions. Different network architectures and input features have been explored in estimating beamforming weights. In this paper, we propose a deep beamformer based on an efficient convolutional recurrent network (CRN) trained with a novel ARray RespOnse-aWare (ARROW) loss function. The ARROW loss exploits the array responses of the target and interferer by using the ground truth relative transfer functions (RTFs). The DNN-based beamforming system, trained with ARROW loss through supervised learning, is able to perform speech enhancement and speaker localization jointly. Experimental results have shown that the proposed deep beamformer, trained with the linearly weighted scale-invariant source-to-noise ratio (SI-SNR) and ARROW loss functions, achieves superior performance in speech enhancement and speaker localization compared to two baselines.
△ Less
Submitted 22 October, 2023; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Safeguarding Learning-based Control for Smart Energy Systems with Sampling Specifications
Authors:
Chih-Hong Cheng,
Venkatesh Prasad Venkataramanan,
Pragya Kirti Gupta,
Yun-Fei Hsu,
Simon Burton
Abstract:
We study challenges using reinforcement learning in controlling energy systems, where apart from performance requirements, one has additional safety requirements such as avoiding blackouts. We detail how these safety requirements in real-time temporal logic can be strengthened via discretization into linear temporal logic (LTL), such that the satisfaction of the LTL formulae implies the satisfacti…
▽ More
We study challenges using reinforcement learning in controlling energy systems, where apart from performance requirements, one has additional safety requirements such as avoiding blackouts. We detail how these safety requirements in real-time temporal logic can be strengthened via discretization into linear temporal logic (LTL), such that the satisfaction of the LTL formulae implies the satisfaction of the original safety requirements. The discretization enables advanced engineering methods such as synthesizing shields for safe reinforcement learning as well as formal verification, where for statistical model checking, the probabilistic guarantee acquired by LTL model checking forms a lower bound for the satisfaction of the original real-time safety requirements.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
-
Array Configuration-Agnostic Personal Voice Activity Detection Based on Spatial Coherence
Authors:
Yicheng Hsu,
Mingsian R. Bai
Abstract:
Personal voice activity detection has received increased attention due to the growing popularity of personal mobile devices and smart speakers. PVAD is often an integral element to speech enhancement and recognition for these applications in which lightweight signal processing is only enabled for the target user. However, in real-world scenarios, the detection performance may degrade because of co…
▽ More
Personal voice activity detection has received increased attention due to the growing popularity of personal mobile devices and smart speakers. PVAD is often an integral element to speech enhancement and recognition for these applications in which lightweight signal processing is only enabled for the target user. However, in real-world scenarios, the detection performance may degrade because of competing speakers, background noise, and reverberation. To address this problem, we proposed to use equivalent rectangular bandwidth ERB-scaled spatial coherence as the input feature to train an array configuration-agnostic PVAD network. Whereas the network model requires only 112k parameters, it exhibits excellent detection performance and robustness in adverse acoustic conditions. Notably, the proposed ARCA-PVAD system is scalable to array configurations. Experimental results have demonstrated the superior performance achieved by the proposed ARCA-PVAD system over a baseline in terms of the area under receiver operating characteristic curve and equal error rate.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Learning-based Robust Speaker Counting and Separation with the Aid of Spatial Coherence
Authors:
Yicheng Hsu,
Mingsian Bai
Abstract:
A three-stage approach is proposed for speaker counting and speech separation in noisy and reverberant environments. In the spatial feature extraction, a spatial coherence matrix (SCM) is computed using whitened relative transfer functions (wRTFs) across time frames. The global activity functions of each speaker are estimated from a simplex constructed using the eigenvectors of the SCM, while the…
▽ More
A three-stage approach is proposed for speaker counting and speech separation in noisy and reverberant environments. In the spatial feature extraction, a spatial coherence matrix (SCM) is computed using whitened relative transfer functions (wRTFs) across time frames. The global activity functions of each speaker are estimated from a simplex constructed using the eigenvectors of the SCM, while the local coherence functions are computed from the coherence between the wRTFs of a time-frequency bin and the global activity function-weighted RTF of the target speaker. In speaker counting, we use the eigenvalues of the SCM and the maximum similarity of the interframe global activity distributions between two speakers as the input features to the speaker counting network (SCnet). In speaker separation, a global and local activity-driven network (GLADnet) is used to extract each independent speaker signal, which is particularly useful for highly overlap** speech signals. Experimental results obtained from the real meeting recordings show that the proposed system achieves superior speaker counting and speaker separation performance compared to previous publications without the prior knowledge of the array configurations.
△ Less
Submitted 7 August, 2023; v1 submitted 13 March, 2023;
originally announced March 2023.
-
Personalized Audio Quality Preference Prediction
Authors:
Chung-Che Wang,
Yu-Chun Lin,
Yu-Teng Hsu,
Jyh-Shing Roger Jang
Abstract:
This paper proposes to use both audio input and subject information to predict the personalized preference of two audio segments with the same content in different qualities. A siamese network is used to compare the inputs and predict the preference. Several different structures for each side of the siamese network are investigated, and an LDNet with PANNs' CNN6 as the encoder and a multi-layer pe…
▽ More
This paper proposes to use both audio input and subject information to predict the personalized preference of two audio segments with the same content in different qualities. A siamese network is used to compare the inputs and predict the preference. Several different structures for each side of the siamese network are investigated, and an LDNet with PANNs' CNN6 as the encoder and a multi-layer perceptron block as the decoder outperforms a baseline model using only audio input the most, where the overall accuracy grows from 77.56% to 78.04%. Experimental results also show that using all the subject information, including age, gender, and the specifications of headphones or earphones, is more effective than using only a part of them.
△ Less
Submitted 16 February, 2023;
originally announced February 2023.
-
Array Configuration-Agnostic Personalized Speech Enhancement using Long-Short-Term Spatial Coherence
Authors:
Yicheng Hsu,
Yonghan Lee,
Mingsian R. Bai
Abstract:
Personalized speech enhancement has been a field of active research for suppression of speechlike interferers such as competing speakers or TV dialogues. Compared with single channel approaches, multichannel PSE systems can be more effective in adverse acoustic conditions by leveraging the spatial information in microphone signals. However, the implementation of multichannel PSEs to accommodate a…
▽ More
Personalized speech enhancement has been a field of active research for suppression of speechlike interferers such as competing speakers or TV dialogues. Compared with single channel approaches, multichannel PSE systems can be more effective in adverse acoustic conditions by leveraging the spatial information in microphone signals. However, the implementation of multichannel PSEs to accommodate a wide range of array topology in household applications can be challenging. To develop an array configuration agnostic PSE system, we define a spatial feature termed the long short term spatial coherence as the input feature to a convolutional recurrent network to monitor the voice activity of the target speaker. As another refinement, an equivalent rectangular bandwidth scaled LSTSC feature can be used to reduce the computational cost. Experiments were conducted to compare the proposed PSE systems, including the complete and the simplified versions with two baselines using unseen room responses and array configurations in the presence of TV noise and competing speakers. The results demonstrated that the proposed multichannel PSE network trained with the LSTSC feature achieved superior enhancement performance without precise knowledge of the array configurations and room responses.
△ Less
Submitted 16 November, 2022;
originally announced November 2022.
-
Model-matching Principle Applied to the Design of an Array-based All-neural Binaural Rendering System for Audio Telepresence
Authors:
Yicheng Hsu,
Chenghumg Ma,
Mingsian R. Bai
Abstract:
Telepresence aims to create an immersive but virtual experience of the audio and visual scene at the far end for users at the near end. In this contribution, we propose an array-based binaural rendering system that converts the array microphone signals into the head-related transfer function (HRTF) filtered output signals for headphone-rendering. The proposed approach is formulated in light of a m…
▽ More
Telepresence aims to create an immersive but virtual experience of the audio and visual scene at the far end for users at the near end. In this contribution, we propose an array-based binaural rendering system that converts the array microphone signals into the head-related transfer function (HRTF) filtered output signals for headphone-rendering. The proposed approach is formulated in light of a model-matching principle (MMP) and is capable of delivering more immersive experience than the conventional localization-beamforming-HRTF filtering (LBH) approach. The MMP-based rendering system can be realized via multichannel inverse filtering (MIF) and multichannel deep filtering (MDF). In this study, we adopted the MDF approach and used the LBH as well as MIF as the baselines. The all-neural system jointly captures the spatial information (spatial rendering), preserves ambient sound (enhancement), and reduces noise (enhancement) before generating binaural outputs. Objective and subjective tests are employed to compare the proposed telepresence system with two baselines.
△ Less
Submitted 6 March, 2023; v1 submitted 20 October, 2022;
originally announced October 2022.
-
Multi-channel target speech enhancement based on ERB-scaled spatial coherence features
Authors:
Yicheng Hsu,
Yonghan Lee,
Mingsian R. Bai
Abstract:
Recently, speech enhancement technologies that are based on deep learning have received considerable research attention. If the spatial information in microphone signals is exploited, microphone arrays can be advantageous under some adverse acoustic conditions compared with single-microphone systems. However, multichannel speech enhancement is often performed in the short-time Fourier transform (S…
▽ More
Recently, speech enhancement technologies that are based on deep learning have received considerable research attention. If the spatial information in microphone signals is exploited, microphone arrays can be advantageous under some adverse acoustic conditions compared with single-microphone systems. However, multichannel speech enhancement is often performed in the short-time Fourier transform (STFT) domain, which renders the enhancement approach computationally expensive. To remedy this problem, we propose a novel equivalent rectangular bandwidth (ERB)-scaled spatial coherence feature that is dependent on the target speaker activity between two ERB bands. Experiments conducted using a four-microphone array in a reverberant environment, which involved speech interference, demonstrated the efficacy of the proposed system. This study also demonstrated that a network that was trained with the ERB-scaled spatial feature was robust against variations in the geometry and number of the microphones in the array.
△ Less
Submitted 17 July, 2022;
originally announced July 2022.
-
Multi-channel end-to-end neural network for speech enhancement, source localization, and voice activity detection
Authors:
Yuan Chen,
Yicheng Hsu,
Mingsian R. Bai
Abstract:
Speech enhancement and source localization has been active research for several decades with a wide range of real-world applications. Recently, the Deep Complex Convolution Recurrent network (DCCRN) has yielded impressive enhancement performance for single-channel systems. In this study, a neural beamformer consisting of a beamformer and a novel multi-channel DCCRN is proposed for speech enhanceme…
▽ More
Speech enhancement and source localization has been active research for several decades with a wide range of real-world applications. Recently, the Deep Complex Convolution Recurrent network (DCCRN) has yielded impressive enhancement performance for single-channel systems. In this study, a neural beamformer consisting of a beamformer and a novel multi-channel DCCRN is proposed for speech enhancement and source localization. Complex-valued filters estimated by the multi-channel DCCRN serve as the weights of beamformer. In addition, a one-stage learning-based procedure is employed for speech enhancement and source localization. The proposed network composed of the multi-channel DCCRN and the auxiliary network models the sound field, while minimizing the distortionless response loss function. Simulation results show that the proposed neural beamformer is effective in enhancing speech signals, with speech quality well preserved. The proposed neural beamformer also provides source localization and voice activity detection (VAD) functions.
△ Less
Submitted 20 June, 2022;
originally announced June 2022.
-
Acoustic echo suppression using a learning-based multi-frame minimum variance distortionless response filter
Authors:
Yuefeng Tsai,
Yicheng Hsu,
Mingsian Bai
Abstract:
Distortion resulting from acoustic echo suppression (AES) is a common issue in full-duplex communication. To address the distortion problem, a multi-frame minimum variance distortionless response (MFMVDR) filtering technique is proposed. The MFMVDR filter with parameter estimation which was used in speech enhancement problems is extended in this study from a deep learning perspective. To alleviate…
▽ More
Distortion resulting from acoustic echo suppression (AES) is a common issue in full-duplex communication. To address the distortion problem, a multi-frame minimum variance distortionless response (MFMVDR) filtering technique is proposed. The MFMVDR filter with parameter estimation which was used in speech enhancement problems is extended in this study from a deep learning perspective. To alleviate numerical instability of the MFMVDR filter, we propose to directly estimate the inverse of the correlation matrix. The AES system is advantageous in that no double-talk detection is required. The negative scale-invariant signal-to-distortion ratio is employed as the loss function in training the network at the output of the MFMVDR filter. Simulation results have demonstrated the efficacy of the proposed learning-based AES system in double-talk, background noise, and nonlinear distortion conditions.
△ Less
Submitted 7 May, 2022;
originally announced May 2022.
-
Learning-based personal speech enhancement for teleconferencing by exploiting spatial-spectral features
Authors:
Yicheng Hsu,
Yonghan Lee,
Mingsian R. Bai
Abstract:
Teleconferencing is becoming essential during the COVID-19 pandemic. However, in real-world applications, speech quality can deteriorate due to, for example, background interference, noise, or reverberation. To solve this problem, target speech extraction from the mixture signals can be performed with the aid of the user's vocal features. Various features are accounted for in this study's proposed…
▽ More
Teleconferencing is becoming essential during the COVID-19 pandemic. However, in real-world applications, speech quality can deteriorate due to, for example, background interference, noise, or reverberation. To solve this problem, target speech extraction from the mixture signals can be performed with the aid of the user's vocal features. Various features are accounted for in this study's proposed system, including speaker embeddings derived from user enrollment and a novel long-short-term spatial coherence feature pertaining to the target speaker activity. As a learning-based approach, a target speech sifting network was employed to extract the relevant features. The network trained with LSTSC in the proposed approach is robust to microphone array geometries and the number of microphones. Furthermore, the proposed enhancement system was compared with a baseline system with speaker embeddings and interchannel phase difference. The results demonstrated the superior performance of the proposed system over the baseline in enhancement performance and robustness.
△ Less
Submitted 29 April, 2022; v1 submitted 10 December, 2021;
originally announced December 2021.
-
SEOFP-NET: Compression and Acceleration of Deep Neural Networks for Speech Enhancement Using Sign-Exponent-Only Floating-Points
Authors:
Yu-Chen Lin,
Cheng Yu,
Yi-Te Hsu,
Szu-Wei Fu,
Yu Tsao,
Tei-Wei Kuo
Abstract:
Numerous compression and acceleration strategies have achieved outstanding results on classification tasks in various fields, such as computer vision and speech signal processing. Nevertheless, the same strategies have yielded ungratified performance on regression tasks because the nature between these and classification tasks differs. In this paper, a novel sign-exponent-only floating-point netwo…
▽ More
Numerous compression and acceleration strategies have achieved outstanding results on classification tasks in various fields, such as computer vision and speech signal processing. Nevertheless, the same strategies have yielded ungratified performance on regression tasks because the nature between these and classification tasks differs. In this paper, a novel sign-exponent-only floating-point network (SEOFP-NET) technique is proposed to compress the model size and accelerate the inference time for speech enhancement, a regression task of speech signal processing. The proposed method compressed the sizes of deep neural network (DNN)-based speech enhancement models by quantizing the fraction bits of single-precision floating-point parameters during training. Before inference implementation, all parameters in the trained SEOFP-NET model are slightly adjusted to accelerate the inference time by replacing the floating-point multiplier with an integer-adder. For generalization, the SEOFP-NET technique is introduced to different speech enhancement tasks in speech signal processing with different model architectures under various corpora. The experimental results indicate that the size of SEOFP-NET models can be significantly compressed by up to 81.249% without noticeably downgrading their speech enhancement performance, and the inference time can be accelerated to 1.212x compared with the baseline models. The results also verify that the proposed SEOFP-NET can cooperate with other efficiency strategies to achieve a synergy effect for model compression. In addition, the just noticeable difference (JND) was applied to the user study experiment to statistically analyze the effect of speech enhancement on listening. The results indicate that the listeners cannot facilely differentiate between the enhanced speech signals processed by the baseline model and the proposed SEOFP-NET.
△ Less
Submitted 8 November, 2021;
originally announced November 2021.
-
Generalized ODIN: Detecting Out-of-distribution Image without Learning from Out-of-distribution Data
Authors:
Yen-Chang Hsu,
Yilin Shen,
Hongxia **,
Zsolt Kira
Abstract:
Deep neural networks have attained remarkable performance when applied to data that comes from the same distribution as that of the training set, but can significantly degrade otherwise. Therefore, detecting whether an example is out-of-distribution (OoD) is crucial to enable a system that can reject such samples or alert users. Recent works have made significant progress on OoD benchmarks consist…
▽ More
Deep neural networks have attained remarkable performance when applied to data that comes from the same distribution as that of the training set, but can significantly degrade otherwise. Therefore, detecting whether an example is out-of-distribution (OoD) is crucial to enable a system that can reject such samples or alert users. Recent works have made significant progress on OoD benchmarks consisting of small image datasets. However, many recent methods based on neural networks rely on training or tuning with both in-distribution and out-of-distribution data. The latter is generally hard to define a-priori, and its selection can easily bias the learning. We base our work on a popular method ODIN, proposing two strategies for freeing it from the needs of tuning with OoD data, while improving its OoD detection performance. We specifically propose to decompose confidence scoring as well as a modified input pre-processing method. We show that both of these significantly help in detection performance. Our further analysis on a larger scale image dataset shows that the two types of distribution shifts, specifically semantic shift and non-semantic shift, present a significant difference in the difficulty of the problem, providing an analysis of when ODIN-like strategies do or do not work.
△ Less
Submitted 31 March, 2020; v1 submitted 25 February, 2020;
originally announced February 2020.
-
J-Net: Randomly weighted U-Net for audio source separation
Authors:
Bo-Wen Chen,
Yen-Min Hsu,
Hung-Yi Lee
Abstract:
Several results in the computer vision literature have shown the potential of randomly weighted neural networks. While they perform fairly well as feature extractors for discriminative tasks, a positive correlation exists between their performance and their fully trained counterparts. According to these discoveries, we pose two questions: what is the value of randomly weighted networks in difficul…
▽ More
Several results in the computer vision literature have shown the potential of randomly weighted neural networks. While they perform fairly well as feature extractors for discriminative tasks, a positive correlation exists between their performance and their fully trained counterparts. According to these discoveries, we pose two questions: what is the value of randomly weighted networks in difficult generative audio tasks such as audio source separation and does such positive correlation still exist when it comes to large random networks and their trained counterparts? In this paper, we demonstrate that the positive correlation still exists. Based on this discovery, we can try out different architecture designs or tricks without training the whole model. Meanwhile, we find a surprising result that in comparison to the non-trained encoder (down-sample path) in Wave-U-Net, fixing the decoder (up-sample path) to random weights results in better performance, almost comparable to the fully trained model.
△ Less
Submitted 28 November, 2019;
originally announced November 2019.
-
Early Findings from Field Trials of Heavy-Duty Truck Connected Eco-Driving System
Authors:
Ziran Wang,
Yuan-Pu Hsu,
Alexander Vu,
Francisco Caballero,
Peng Hao,
Guoyuan Wu,
Kanok Boriboonsomsin,
Matthew J. Barth,
Aravind Kailas,
Pascal Amar,
Eddie Garmon,
Sandeep Tanugula
Abstract:
In recent years, the development of connected and automated vehicle (CAV) technology has inspired numerous advanced applications targeted at improving existing transportation systems. As one of the widely studied applications of CAV technology, connected eco-driving takes advantage of Signal Phase and Timing (SPaT) information from traffic signals to enable CAVs to approach and depart from signali…
▽ More
In recent years, the development of connected and automated vehicle (CAV) technology has inspired numerous advanced applications targeted at improving existing transportation systems. As one of the widely studied applications of CAV technology, connected eco-driving takes advantage of Signal Phase and Timing (SPaT) information from traffic signals to enable CAVs to approach and depart from signalized intersections in an energy-efficient manner. However the majority of the connected eco-driving studies have been numerical or microscopic traffic simulations. Only few studies have implemented the application on real vehicles, and even fewer have been focused on heavy-duty trucks. In this study, we developed a connected eco-driving system and equipped it on a heavy-duty diesel truck using cellular-based wireless communications. Field trials were conducted in the City ofCarson, California, along two corridors with six connected signalized intersections capable of communicating their SPaT information. Early results showed the benefits of the system in smoothing the speed profiles of the equipped truck when approaching the connected signalized intersections.
△ Less
Submitted 31 July, 2019;
originally announced August 2019.
-
Robustness against the channel effect in pathological voice detection
Authors:
Yi-Te Hsu,
Zining Zhu,
Chi-Te Wang,
Shih-Hau Fang,
Frank Rudzicz,
Yu Tsao
Abstract:
Many people are suffering from voice disorders, which can adversely affect the quality of their lives. In response, some researchers have proposed algorithms for automatic assessment of these disorders, based on voice signals. However, these signals can be sensitive to the recording devices. Indeed, the channel effect is a pervasive problem in machine learning for healthcare. In this study, we pro…
▽ More
Many people are suffering from voice disorders, which can adversely affect the quality of their lives. In response, some researchers have proposed algorithms for automatic assessment of these disorders, based on voice signals. However, these signals can be sensitive to the recording devices. Indeed, the channel effect is a pervasive problem in machine learning for healthcare. In this study, we propose a detection system for pathological voice, which is robust against the channel effect. This system is based on a bidirectional LSTM network. To increase the performance robustness against channel mismatch, we integrate domain adversarial training (DAT) to eliminate the differences between the devices. When we train on data recorded on a high-quality microphone and evaluate on smartphone data without labels, our robust detection system increases the PR-AUC from 0.8448 to 0.9455 (and 0.9522 with target sample labels). To the best of our knowledge, this is the first study applying unsupervised domain adaptation to pathological voice detection. Notably, our system does not need target device sample labels, which allows for generalization to many new devices.
△ Less
Submitted 2 December, 2018; v1 submitted 26 November, 2018;
originally announced November 2018.
-
A study on speech enhancement using exponent-only floating point quantized neural network (EOFP-QNN)
Authors:
Yi-Te Hsu,
Yu-Chen Lin,
Szu-Wei Fu,
Yu Tsao,
Tei-Wei Kuo
Abstract:
Numerous studies have investigated the effectiveness of neural network quantization on pattern classification tasks. The present study, for the first time, investigated the performance of speech enhancement (a regression task in speech processing) using a novel exponent-only floating-point quantized neural network (EOFP-QNN). The proposed EOFP-QNN consists of two stages: mantissa-quantization and…
▽ More
Numerous studies have investigated the effectiveness of neural network quantization on pattern classification tasks. The present study, for the first time, investigated the performance of speech enhancement (a regression task in speech processing) using a novel exponent-only floating-point quantized neural network (EOFP-QNN). The proposed EOFP-QNN consists of two stages: mantissa-quantization and exponent-quantization. In the mantissa-quantization stage, EOFP-QNN learns how to quantize the mantissa bits of the model parameters while preserving the regression accuracy using the least mantissa precision. In the exponent-quantization stage, the exponent part of the parameters is further quantized without causing any additional performance degradation. We evaluated the proposed EOFP quantization technique on two types of neural networks, namely, bidirectional long short-term memory (BLSTM) and fully convolutional neural network (FCN), on a speech enhancement task. Experimental results showed that the model sizes can be significantly reduced (the model sizes of the quantized BLSTM and FCN models were only 18.75% and 21.89%, respectively, compared to those of the original models) while maintaining satisfactory speech-enhancement performance.
△ Less
Submitted 30 October, 2018; v1 submitted 17 August, 2018;
originally announced August 2018.