Search | arXiv e-print repository

On the Importance of Neural Wiener Filter for Resource Efficient Multichannel Speech Enhancement

Authors: Tsun-An Hsieh, Jacob Donley, Daniel Wong, Buye Xu, Ashutosh Pandey

Abstract: We introduce a time-domain framework for efficient multichannel speech enhancement, emphasizing low latency and computational efficiency. This framework incorporates two compact deep neural networks (DNNs) surrounding a multichannel neural Wiener filter (NWF). The first DNN enhances the speech signal to estimate NWF coefficients, while the second DNN refines the output from the NWF. The NWF, while… ▽ More We introduce a time-domain framework for efficient multichannel speech enhancement, emphasizing low latency and computational efficiency. This framework incorporates two compact deep neural networks (DNNs) surrounding a multichannel neural Wiener filter (NWF). The first DNN enhances the speech signal to estimate NWF coefficients, while the second DNN refines the output from the NWF. The NWF, while conceptually similar to the traditional frequency-domain Wiener filter, undergoes a training process optimized for low-latency speech enhancement, involving fine-tuning of both analysis and synthesis transforms. Our research results illustrate that the NWF output, having minimal nonlinear distortions, attains performance levels akin to those of the first DNN, deviating from conventional Wiener filter paradigms. Training all components jointly outperforms sequential training, despite its simplicity. Consequently, this framework achieves superior performance with fewer parameters and reduced computational demands, making it a compelling solution for resource-efficient multichannel speech enhancement. △ Less

Submitted 15 January, 2024; originally announced January 2024.

Comments: Accepted for publication at ICASSP

arXiv:2312.15320 [pdf]

GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Texts

Authors: Da Wu, **gye Yang, Cong Liu, Tzung-Chien Hsieh, Elaine Marchi, Justin Blair, Peter Krawitz, Chunhua Weng, Wendy Chung, Gholson J. Lyon, Ian D. Krantz, Jennifer M. Kalish, Kai Wang

Abstract: Individuals with suspected rare genetic disorders often undergo multiple clinical evaluations, imaging studies, laboratory tests and genetic tests, to find a possible answer over a prolonged period of time. Addressing this "diagnostic odyssey" thus has substantial clinical, psychosocial, and economic benefits. Many rare genetic diseases have distinctive facial features, which can be used by artifi… ▽ More Individuals with suspected rare genetic disorders often undergo multiple clinical evaluations, imaging studies, laboratory tests and genetic tests, to find a possible answer over a prolonged period of time. Addressing this "diagnostic odyssey" thus has substantial clinical, psychosocial, and economic benefits. Many rare genetic diseases have distinctive facial features, which can be used by artificial intelligence algorithms to facilitate clinical diagnosis, in prioritizing candidate diseases to be further examined by lab tests or genetic assays, or in hel** the phenotype-driven reinterpretation of genome/exome sequencing data. Existing methods using frontal facial photos were built on conventional Convolutional Neural Networks (CNNs), rely exclusively on facial images, and cannot capture non-facial phenotypic traits and demographic information essential for guiding accurate diagnoses. Here we introduce GestaltMML, a multimodal machine learning (MML) approach solely based on the Transformer architecture. It integrates facial images, demographic information (age, sex, ethnicity), and clinical notes (optionally, a list of Human Phenotype Ontology terms) to improve prediction accuracy. Furthermore, we also evaluated GestaltMML on a diverse range of datasets, including 528 diseases from the GestaltMatcher Database, several in-house datasets of Beckwith-Wiedemann syndrome (BWS, over-growth syndrome with distinct facial features), Sotos syndrome (overgrowth syndrome with overlap** features with BWS), NAA10-related neurodevelopmental syndrome, Cornelia de Lange syndrome (multiple malformation syndrome), and KBG syndrome (multiple malformation syndrome). Our results suggest that GestaltMML effectively incorporates multiple modalities of data, greatly narrowing candidate genetic diagnoses of rare diseases and may facilitate the reinterpretation of genome/exome sequencing data. △ Less

Submitted 21 April, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

Comments: Significant revisions

arXiv:2312.01644 [pdf]

TMSR: Tiny Multi-path CNNs for Super Resolution

Authors: Chia-Hung Liu, Tzu-Hsin Hsieh, Kuan-Yu Huang, Pei-Yin Chen

Abstract: In this paper, we proposed a tiny multi-path CNN-based Super-Resolution (SR) method, called TMSR. We mainly refer to some tiny CNN-based SR methods, under 5k parameters. The main contribution of the proposed method is the improved multi-path learning and self-defined activated function. The experimental results show that TMSR obtains competitive image quality (i.e. PSNR and SSIM) compared to the r… ▽ More In this paper, we proposed a tiny multi-path CNN-based Super-Resolution (SR) method, called TMSR. We mainly refer to some tiny CNN-based SR methods, under 5k parameters. The main contribution of the proposed method is the improved multi-path learning and self-defined activated function. The experimental results show that TMSR obtains competitive image quality (i.e. PSNR and SSIM) compared to the related works under 5k parameters. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: 5 pages, 7 figures, published in the IEEE Eurasia Conference on IoT, Communication and Engineering proceedings 2023

arXiv:2311.11783 [pdf]

CityScope: Enhanced Localozation and Synchronizing AR for Dynamic Urban Weather Visualization

Authors: Tzu Hsin Hsieh

Abstract: CityScope uses augmented reality (AR) to change our interaction with weather data. The main goal is to develop real-time 3D weather visualizations, with Taiwan as the model. It displays live weather data from the Central Weather Bureau (CWB), projected onto a physical representation of Taiwan's landscape. A pivotal advancement in our project is the integration of AprilTag with plane detection tech… ▽ More CityScope uses augmented reality (AR) to change our interaction with weather data. The main goal is to develop real-time 3D weather visualizations, with Taiwan as the model. It displays live weather data from the Central Weather Bureau (CWB), projected onto a physical representation of Taiwan's landscape. A pivotal advancement in our project is the integration of AprilTag with plane detection technology. This innovative combination significantly enhances the precision of the virtual visualizations within the physical world. By accurately aligning AR elements with real-world environments, CityScope achieves a seamless and realistic amalgamation of weather data and the physical terrain of Taiwan. This breakthrough in AR technology not only enhances the accuracy of weather visualizations but also enriches user experience, offering an immersive and interactive way to understand and engage with meteorological information. CityScope stands as a testament to the potential of AR in transforming data visualization and public engagement in meteorology. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: 9 pages, 15 figures

arXiv:2307.10490 [pdf, other]

Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs

Authors: Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, Vitaly Shmatikov

Abstract: We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and… ▽ More We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker's instruction. We illustrate this attack with several proof-of-concept examples targeting LLaVa and PandaGPT. △ Less

Submitted 3 October, 2023; v1 submitted 19 July, 2023; originally announced July 2023.

arXiv:2305.02143 [pdf, other]

doi 10.1145/3641107

GANonymization: A GAN-based Face Anonymization Framework for Preserving Emotional Expressions

Authors: Fabio Hellmann, Silvan Mertes, Mohamed Benouis, Alexander Hustinx, Tzung-Chien Hsieh, Cristina Conati, Peter Krawitz, Elisabeth André

Abstract: In recent years, the increasing availability of personal data has raised concerns regarding privacy and security. One of the critical processes to address these concerns is data anonymization, which aims to protect individual privacy and prevent the release of sensitive information. This research focuses on the importance of face anonymization. Therefore, we introduce GANonymization, a novel face… ▽ More In recent years, the increasing availability of personal data has raised concerns regarding privacy and security. One of the critical processes to address these concerns is data anonymization, which aims to protect individual privacy and prevent the release of sensitive information. This research focuses on the importance of face anonymization. Therefore, we introduce GANonymization, a novel face anonymization framework with facial expression-preserving abilities. Our approach is based on a high-level representation of a face, which is synthesized into an anonymized version based on a generative adversarial network (GAN). The effectiveness of the approach was assessed by evaluating its performance in removing identifiable facial attributes to increase the anonymity of the given individual face. Additionally, the performance of preserving facial expressions was evaluated on several affect recognition datasets and outperformed the state-of-the-art methods in most categories. Finally, our approach was analyzed for its ability to remove various facial traits, such as jewelry, hair color, and multiple others. Here, it demonstrated reliable performance in removing these attributes. Our results suggest that GANonymization is a promising approach for anonymizing faces while preserving facial expressions. △ Less

Submitted 14 November, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

Comments: 26 pages, 11 figures, 6 tables, ACM Transactions on Multimedia Computing, Communications, and Applications

arXiv:2211.06764 [pdf, other]

doi 10.1109/WACV56688.2023.00499

Improving Deep Facial Phenoty** for Ultra-rare Disorder Verification Using Model Ensembles

Authors: Alexander Hustinx, Fabio Hellmann, Ömer Sümer, Behnam Javanmardi, Elisabeth André, Peter Krawitz, Tzung-Chien Hsieh

Abstract: Rare genetic disorders affect more than 6% of the global population. Reaching a diagnosis is challenging because rare disorders are very diverse. Many disorders have recognizable facial features that are hints for clinicians to diagnose patients. Previous work, such as GestaltMatcher, utilized representation vectors produced by a DCNN similar to AlexNet to match patients in high-dimensional featur… ▽ More Rare genetic disorders affect more than 6% of the global population. Reaching a diagnosis is challenging because rare disorders are very diverse. Many disorders have recognizable facial features that are hints for clinicians to diagnose patients. Previous work, such as GestaltMatcher, utilized representation vectors produced by a DCNN similar to AlexNet to match patients in high-dimensional feature space to support "unseen" ultra-rare disorders. However, the architecture and dataset used for transfer learning in GestaltMatcher have become outdated. Moreover, a way to train the model for generating better representation vectors for unseen ultra-rare disorders has not yet been studied. Because of the overall scarcity of patients with ultra-rare disorders, it is infeasible to directly train a model on them. Therefore, we first analyzed the influence of replacing GestaltMatcher DCNN with a state-of-the-art face recognition approach, iResNet with ArcFace. Additionally, we experimented with different face recognition datasets for transfer learning. Furthermore, we proposed test-time augmentation, and model ensembles that mix general face verification models and models specific for verifying disorders to improve the disorder verification accuracy of unseen ultra-rare disorders. Our proposed ensemble model achieves state-of-the-art performance on both seen and unseen disorders. △ Less

Submitted 12 November, 2022; originally announced November 2022.

Journal ref: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

arXiv:2211.01189 [pdf, other]

Inference and Denoise: Causal Inference-based Neural Speech Enhancement

Authors: Tsun-An Hsieh, Chao-Han Huck Yang, Pin-Yu Chen, Sabato Marco Siniscalchi, Yu Tsao

Abstract: This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement module… ▽ More This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE. Specifically, we use the presence of noise as guidance for EM selection during training, and the noise detector selects the enhancement module according to the prediction of the presence of noise for each frame. Moreover, we derived a SE-specific average treatment effect to quantify the causal effect adequately. Experimental evidence demonstrates that CISE outperforms a non-causal mask-based SE approach in the studied settings and has better performance and efficiency than more complex SE models. △ Less

Submitted 2 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP 2023

arXiv:2210.12705 [pdf, other]

doi 10.3233/SHTI230312

Few-Shot Meta Learning for Recognizing Facial Phenotypes of Genetic Disorders

Authors: Ömer Sümer, Fabio Hellmann, Alexander Hustinx, Tzung-Chien Hsieh, Elisabeth André, Peter Krawitz

Abstract: Computer vision-based methods have valuable use cases in precision medicine, and recognizing facial phenotypes of genetic disorders is one of them. Many genetic disorders are known to affect faces' visual appearance and geometry. Automated classification and similarity retrieval aid physicians in decision-making to diagnose possible genetic conditions as early as possible. Previous work has addres… ▽ More Computer vision-based methods have valuable use cases in precision medicine, and recognizing facial phenotypes of genetic disorders is one of them. Many genetic disorders are known to affect faces' visual appearance and geometry. Automated classification and similarity retrieval aid physicians in decision-making to diagnose possible genetic conditions as early as possible. Previous work has addressed the problem as a classification problem and used deep learning methods. The challenging issue in practice is the sparse label distribution and huge class imbalances across categories. Furthermore, most disorders have few labeled samples in training sets, making representation learning and generalization essential to acquiring a reliable feature descriptor. In this study, we used a facial recognition model trained on a large corpus of healthy individuals as a pre-task and transferred it to facial phenotype recognition. Furthermore, we created simple baselines of few-shot meta-learning methods to improve our base feature descriptor. Our quantitative results on GestaltMatcher Database show that our CNN baseline surpasses previous works, including GestaltMatcher, and few-shot meta-learning strategies improve retrieval performance in frequent and rare classes. △ Less

Submitted 24 May, 2023; v1 submitted 23 October, 2022; originally announced October 2022.

Comments: This paper is accepted for publication at MIE 2023 Conference

arXiv:2202.09907 [pdf, other]

towards automatic transcription of polyphonic electric guitar music:a new dataset and a multi-loss transformer model

Authors: Yu-Hua Chen, Wen-Yi Hsiao, Tsu-Kuang Hsieh, Jyh-Shing Roger Jang, Yi-Hsuan Yang

Abstract: In this paper, we propose a new dataset named EGDB, that con-tains transcriptions of the electric guitar performance of 240 tab-latures rendered with different tones. Moreover, we benchmark theperformance of two well-known transcription models proposed orig-inally for the piano on this dataset, along with a multi-loss Trans-former model that we newly propose. Our evaluation on this datasetand a se… ▽ More In this paper, we propose a new dataset named EGDB, that con-tains transcriptions of the electric guitar performance of 240 tab-latures rendered with different tones. Moreover, we benchmark theperformance of two well-known transcription models proposed orig-inally for the piano on this dataset, along with a multi-loss Trans-former model that we newly propose. Our evaluation on this datasetand a separate set of real-world recordings demonstrate the influenceof timbre on the accuracy of guitar sheet transcription, the potentialof using multiple losses for Transformers, as well as the room forfurther improvement for this task. △ Less

Submitted 20 February, 2022; originally announced February 2022.

Comments: to be published at ICASSP 2022

arXiv:2201.09208 [pdf]

Design of Sensor Fusion Driver Assistance System for Active Pedestrian Safety

Authors: I-Hsi Kao, Ya-Zhu Yian, Jian-An Su, Yi-Horng Lai, Jau-Woei Perng, Tung-Li Hsieh, Yi-Shueh Tsai, Min-Shiu Hsieh

Abstract: In this paper, we present a parallel architecture for a sensor fusion detection system that combines a camera and 1D light detection and ranging (lidar) sensor for object detection. The system contains two object detection methods, one based on an optical flow, and the other using lidar. The two sensors can effectively complement the defects of the other. The accurate longitudinal accuracy of the… ▽ More In this paper, we present a parallel architecture for a sensor fusion detection system that combines a camera and 1D light detection and ranging (lidar) sensor for object detection. The system contains two object detection methods, one based on an optical flow, and the other using lidar. The two sensors can effectively complement the defects of the other. The accurate longitudinal accuracy of the object's location and its lateral movement information can be achieved simultaneously. Using a spatio-temporal alignment and a policy of sensor fusion, we completed the development of a fusion detection system with high reliability at distances of up to 20 m. Test results show that the proposed system achieves a high level of accuracy for pedestrian or object detection in front of a vehicle, and has high robustness to special environments. △ Less

Submitted 23 January, 2022; originally announced January 2022.

Comments: The 14th International Conference on Automation Technology (Automation 2017), December 8-10, 2017, Kaohsiung, Taiwan

arXiv:2111.05703 [pdf, other]

OSSEM: one-shot speaker adaptive speech enhancement using meta learning

Authors: Cheng Yu, Szu-Wei Fu, Tsun-An Hsieh, Yu Tsao, Mirco Ravanelli

Abstract: Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE system to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a one-shot manner. OSSEM consists of a modified tra… ▽ More Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE system to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a one-shot manner. OSSEM consists of a modified transformer SE network and a speaker-specific masking (SSM) network. In practice, the SSM network takes an enrolled speaker embedding extracted using ECAPA-TDNN to adjust the input noisy feature through masking. To evaluate OSSEM, we designed a modified Voice Bank-DEMAND dataset, in which one utterance from the testing set was used for model adaptation, and the remaining utterances were used for testing the performance. Moreover, we set restrictions allowing the enhancement process to be conducted in real time, and thus designed OSSEM to be a causal SE system. Experimental results first show that OSSEM can effectively adapt a pretrained SE model to a particular speaker with only one utterance, thus yielding improved SE results. Meanwhile, OSSEM exhibits a competitive performance compared to state-of-the-art causal SE systems. △ Less

Submitted 10 November, 2021; originally announced November 2021.

arXiv:2106.05229 [pdf, other]

Speech Recovery for Real-World Self-powered Intermittent Devices

Authors: Yu-Chen Lin, Tsun-An Hsieh, Kuo-Hsuan Hung, Cheng Yu, Harinath Garudadri, Yu Tsao, Tei-Wei Kuo

Abstract: The incompleteness of speech inputs severely degrades the performance of all the related speech signal processing applications. Although many researches have been proposed to address this issue, they controlled the data missing conditions by simulation with self-defined masking lengths or sizes. Besides, the masking definitions are different among all these experimental settings. This paper presen… ▽ More The incompleteness of speech inputs severely degrades the performance of all the related speech signal processing applications. Although many researches have been proposed to address this issue, they controlled the data missing conditions by simulation with self-defined masking lengths or sizes. Besides, the masking definitions are different among all these experimental settings. This paper presents a novel intermittent speech recovery (ISR) system for real-world self-powered intermittent devices. Three contributive stages: interpolation, enhancement, and combination are applied to the ISR system for speech reconstruction. The experimental results show that our recovery system increases speech quality by up to 591.7%, while increasing speech intelligibility by up to 80.5%. Most importantly, the proposed ISR system improves the WER scores by up to 52.6%. The promising results not only confirm the effectiveness of the reconstruction but also encourage the utilization of these battery-free wearable/IoT devices. △ Less

Submitted 24 January, 2022; v1 submitted 9 June, 2021; originally announced June 2021.

arXiv:2104.06402 [pdf, other]

DropLoss for Long-Tail Instance Segmentation

Authors: Ting-I Hsieh, Esther Robb, Hwann-Tzong Chen, Jia-Bin Huang

Abstract: Long-tailed class distributions are prevalent among the practical applications of object detection and instance segmentation. Prior work in long-tail instance segmentation addresses the imbalance of losses between rare and frequent categories by reducing the penalty for a model incorrectly predicting a rare class label. We demonstrate that the rare categories are heavily suppressed by correct back… ▽ More Long-tailed class distributions are prevalent among the practical applications of object detection and instance segmentation. Prior work in long-tail instance segmentation addresses the imbalance of losses between rare and frequent categories by reducing the penalty for a model incorrectly predicting a rare class label. We demonstrate that the rare categories are heavily suppressed by correct background predictions, which reduces the probability for all foreground categories with equal weight. Due to the relative infrequency of rare categories, this leads to an imbalance that biases towards predicting more frequent categories. Based on this insight, we develop DropLoss -- a novel adaptive loss to compensate for this imbalance without a trade-off between rare and frequent categories. With this loss, we show state-of-the-art mAP across rare, common, and frequent categories on the LVIS dataset. △ Less

Submitted 17 April, 2021; v1 submitted 13 April, 2021; originally announced April 2021.

Comments: Code at https://github.com/timy90022/DropLoss

Journal ref: AAAI 2021

arXiv:2104.03538 [pdf]

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement

Authors: Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao

Abstract: The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory. Objective evaluation metrics which consider human perception can hence serve as a bridge to reduce the gap. Our previously proposed MetricGAN was designed to optimize objective metrics by connecting the metric with a discr… ▽ More The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory. Objective evaluation metrics which consider human perception can hence serve as a bridge to reduce the gap. Our previously proposed MetricGAN was designed to optimize objective metrics by connecting the metric with a discriminator. Because only the scores of the target evaluation functions are needed during training, the metrics can even be non-differentiable. In this study, we propose a MetricGAN+ in which three training techniques incorporating domain-knowledge of speech processing are proposed. With these techniques, experimental results on the VoiceBank-DEMAND dataset show that MetricGAN+ can increase PESQ score by 0.3 compared to the previous MetricGAN and achieve state-of-the-art results (PESQ score = 3.15). △ Less

Submitted 4 June, 2021; v1 submitted 8 April, 2021; originally announced April 2021.

Comments: Accepted by Interspeech 2021

arXiv:2011.11631 [pdf, ps, other]

Explainable Multivariate Time Series Classification: A Deep Neural Network Which Learns To Attend To Important Variables As Well As Informative Time Intervals

Authors: Tsung-Yu Hsieh, Suhang Wang, Yiwei Sun, Vasant Honavar

Abstract: Time series data is prevalent in a wide variety of real-world applications and it calls for trustworthy and explainable models for people to understand and fully trust decisions made by AI solutions. We consider the problem of building explainable classifiers from multi-variate time series data. A key criterion to understand such predictive models involves elucidating and quantifying the contribut… ▽ More Time series data is prevalent in a wide variety of real-world applications and it calls for trustworthy and explainable models for people to understand and fully trust decisions made by AI solutions. We consider the problem of building explainable classifiers from multi-variate time series data. A key criterion to understand such predictive models involves elucidating and quantifying the contribution of time varying input variables to the classification. Hence, we introduce a novel, modular, convolution-based feature extraction and attention mechanism that simultaneously identifies the variables as well as time intervals which determine the classifier output. We present results of extensive experiments with several benchmark data sets that show that the proposed method outperforms the state-of-the-art baseline methods on multi-variate time series classification task. The results of our case studies demonstrate that the variables and time intervals identified by the proposed method make sense relative to available domain knowledge. △ Less

Submitted 23 November, 2020; originally announced November 2020.

arXiv:2010.15174 [pdf, other]

Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement

Authors: Tsun-An Hsieh, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao

Abstract: Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e.g. phones and syllables. In this study, we propose a novel phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models. To effectively incorporate the phonetic information… ▽ More Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e.g. phones and syllables. In this study, we propose a novel phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models. To effectively incorporate the phonetic information, the PFPL is computed based on latent representations of the wav2vec model, a powerful self-supervised encoder that renders rich phonetic information. To more accurately measure the distribution distances of the latent representations, the PFPL adopts the Wasserstein distance as the distance measure. Our experimental results first reveal that the PFPL is more correlated with the perceptual evaluation metrics, as compared to signal-level losses. Moreover, the results showed that the PFPL can enable a deep complex U-Net SE model to achieve highly competitive performance in terms of standardized quality and intelligibility evaluations on the Voice Bank-DEMAND dataset. △ Less

Submitted 27 April, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

arXiv:2006.10296 [pdf]

Boosting Objective Scores of a Speech Enhancement Model by MetricGAN Post-processing

Authors: Szu-Wei Fu, Chien-Feng Liao, Tsun-An Hsieh, Kuo-Hsuan Hung, Syu-Siang Wang, Cheng Yu, Heng-Cheng Kuo, Ryandhimas E. Zezario, You-** Li, Shang-Yi Chuang, Yen-Ju Lu, Yu Tsao

Abstract: The Transformer architecture has demonstrated a superior ability compared to recurrent neural networks in many different natural language processing applications. Therefore, our study applies a modified Transformer in a speech enhancement task. Specifically, positional encoding in the Transformer may not be necessary for speech enhancement, and hence, it is replaced by convolutional layers. To fur… ▽ More The Transformer architecture has demonstrated a superior ability compared to recurrent neural networks in many different natural language processing applications. Therefore, our study applies a modified Transformer in a speech enhancement task. Specifically, positional encoding in the Transformer may not be necessary for speech enhancement, and hence, it is replaced by convolutional layers. To further improve the perceptual evaluation of the speech quality (PESQ) scores of enhanced speech, the L_1 pre-trained Transformer is fine-tuned using a MetricGAN framework. The proposed MetricGAN can be treated as a general post-processing module to further boost the objective scores of interest. The experiments were conducted using the data sets provided by the organizer of the Deep Noise Suppression (DNS) challenge. Experimental results demonstrated that the proposed system outperformed the challenge baseline, in both subjective and objective evaluations, with a large margin. △ Less

Submitted 3 March, 2021; v1 submitted 18 June, 2020; originally announced June 2020.

Comments: Accepted by APSIPA 2020

arXiv:2004.04098 [pdf, other]

doi 10.1109/LSP.2020.3040693

WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-end Speech Enhancement

Authors: Tsun-An Hsieh, Hsin-Min Wang, Xugang Lu, Yu Tsao

Abstract: Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the locality and temporal sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered or are too co… ▽ More Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the locality and temporal sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered or are too complex to be realized. In this paper, we propose an efficient E2E SE model, termed WaveCRN. In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU). Unlike a conventional temporal sequential model that uses a long short-term memory (LSTM) network, which is difficult to parallelize, SRU can be efficiently parallelized in calculation with even fewer model parameters. In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers; this is different from the approach that applies the estimated ratio mask on the noisy spectral features, which is commonly used in speech separation methods. Experimental results on speech denoising and compressed speech restoration tasks confirm that with the lightweight architecture of SRU and the feature-map**-based RFM, WaveCRN performs comparably with other state-of-the-art approaches with notably reduced model complexity and inference time. △ Less

Submitted 26 November, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

arXiv:2002.06817 [pdf, other]

Addressing the confounds of accompaniments in singer identification

Authors: Tsung-Han Hsieh, Kai-Hsiang Cheng, Zhe-Cheng Fan, Yu-Ching Yang, Yi-Hsuan Yang

Abstract: Identifying singers is an important task with many applications. However, the task remains challenging due to many issues. One major issue is related to the confounding factors from the background instrumental music that is mixed with the vocals in music production. A singer identification model may learn to extract non-vocal related features from the instrumental part of the songs, if a singer on… ▽ More Identifying singers is an important task with many applications. However, the task remains challenging due to many issues. One major issue is related to the confounding factors from the background instrumental music that is mixed with the vocals in music production. A singer identification model may learn to extract non-vocal related features from the instrumental part of the songs, if a singer only sings in certain musical contexts (e.g., genres). The model cannot therefore generalize well when the singer sings in unseen contexts. In this paper, we attempt to address this issue. Specifically, we employ open-unmix, an open source tool with state-of-the-art performance in source separation, to separate the vocal and instrumental tracks of music. We then investigate two means to train a singer identification model: by learning from the separated vocal only, or from an augmented set of data where we "shuffle-and-remix" the separated vocal tracks and instrumental tracks of different songs to artificially make the singers sing in different contexts. We also incorporate melodic features learned from the vocal melody contour for better performance. Evaluation results on a benchmark dataset called the artist20 shows that this data augmentation method greatly improves the accuracy of singer identification. △ Less

Submitted 17 February, 2020; originally announced February 2020.

arXiv:1911.12529 [pdf, other]

One-Shot Object Detection with Co-Attention and Co-Excitation

Authors: Ting-I Hsieh, Yi-Chen Lo, Hwann-Tzong Chen, Tyng-Luh Liu

Abstract: This paper aims to tackle the challenging problem of one-shot object detection. Given a query image patch whose class label is not included in the training data, the goal of the task is to detect all instances of the same class in a target image. To this end, we develop a novel {\em co-attention and co-excitation} (CoAE) framework that makes contributions in three key technical aspects. First, we… ▽ More This paper aims to tackle the challenging problem of one-shot object detection. Given a query image patch whose class label is not included in the training data, the goal of the task is to detect all instances of the same class in a target image. To this end, we develop a novel {\em co-attention and co-excitation} (CoAE) framework that makes contributions in three key technical aspects. First, we propose to use the non-local operation to explore the co-attention embodied in each query-target pair and yield region proposals accounting for the one-shot situation. Second, we formulate a squeeze-and-co-excitation scheme that can adaptively emphasize correlated feature channels to help uncover relevant proposals and eventually the target objects. Third, we design a margin-based ranking loss for implicitly learning a metric to predict the similarity of a region proposal to the underlying query, no matter its class label is seen or unseen in training. The resulting model is therefore a two-stage detector that yields a strong baseline on both VOC and MS-COCO under one-shot setting of detecting objects from both seen and never-seen classes. Codes are available at https://github.com/timy90022/One-Shot-Object-Detection. △ Less

Submitted 28 November, 2019; originally announced November 2019.

Comments: NeurIPS 2019

arXiv:1909.06543 [pdf, other]

Node Injection Attacks on Graphs via Reinforcement Learning

Authors: Yiwei Sun, Suhang Wang, Xianfeng Tang, Tsung-Yu Hsieh, Vasant Honavar

Abstract: Real-world graph applications, such as advertisements and product recommendations make profits based on accurately classify the label of the nodes. However, in such scenarios, there are high incentives for the adversaries to attack such graph to reduce the node classification performance. Previous work on graph adversarial attacks focus on modifying existing graph structures, which is infeasible i… ▽ More Real-world graph applications, such as advertisements and product recommendations make profits based on accurately classify the label of the nodes. However, in such scenarios, there are high incentives for the adversaries to attack such graph to reduce the node classification performance. Previous work on graph adversarial attacks focus on modifying existing graph structures, which is infeasible in most real-world applications. In contrast, it is more practical to inject adversarial nodes into existing graphs, which can also potentially reduce the performance of the classifier. In this paper, we study the novel node injection poisoning attacks problem which aims to poison the graph. We describe a reinforcement learning based method, namely NIPA, to sequentially modify the adversarial information of the injected nodes. We report the results of experiments using several benchmark data sets that show the superior performance of the proposed method NIPA, relative to the existing state-of-the-art methods. △ Less

Submitted 14 September, 2019; originally announced September 2019.

Comments: Preprint, under review

arXiv:1909.01084 [pdf, other]

MEGAN: A Generative Adversarial Network for Multi-View Network Embedding

Authors: Yiwei Sun, Suhang Wang, Tsung-Yu Hsieh, Xianfeng Tang, Vasant Honavar

Abstract: Data from many real-world applications can be naturally represented by multi-view networks where the different views encode different types of relationships (e.g., friendship, shared interests in music, etc.) between real-world individuals or entities. There is an urgent need for methods to obtain low-dimensional, information preserving and typically nonlinear embeddings of such multi-view network… ▽ More Data from many real-world applications can be naturally represented by multi-view networks where the different views encode different types of relationships (e.g., friendship, shared interests in music, etc.) between real-world individuals or entities. There is an urgent need for methods to obtain low-dimensional, information preserving and typically nonlinear embeddings of such multi-view networks. However, most of the work on multi-view learning focuses on data that lack a network structure, and most of the work on network embeddings has focused primarily on single-view networks. Against this background, we consider the multi-view network representation learning problem, i.e., the problem of constructing low-dimensional information preserving embeddings of multi-view networks. Specifically, we investigate a novel Generative Adversarial Network (GAN) framework for Multi-View Network Embedding, namely MEGAN, aimed at preserving the information from the individual network views, while accounting for connectivity across (and hence complementarity of and correlations between) different views. The results of our experiments on two real-world multi-view data sets show that the embeddings obtained using MEGAN outperform the state-of-the-art methods on node classification, link prediction and visualization tasks. △ Less

Submitted 19 August, 2019; originally announced September 2019.

Comments: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19

arXiv:1906.02772 [pdf]

Adaptive Subspace Sampling for Class Imbalance Processing-Some clarifications, algorithm, and further investigation including applications to Brain Computer Interface

Authors: Chin-Teng Lin, Kuan-Chih Huang, Yu-Ting Liu, Yang-Yin Lin, Tsung-Yu Hsieh, Nikhil R. Pal, Shang-Lin Wu, Chieh-Ning Fang, Zehong Cao

Abstract: Kohonen's Adaptive Subspace Self-Organizing Map (ASSOM) learns several subspaces of the data where each subspace represents some invariant characteristics of the data. To deal with the imbalance classification problem, earlier we have proposed a method for oversampling the minority class using Kohonen's ASSOM. This investigation extends that study, clarifies some issues related to our earlier work… ▽ More Kohonen's Adaptive Subspace Self-Organizing Map (ASSOM) learns several subspaces of the data where each subspace represents some invariant characteristics of the data. To deal with the imbalance classification problem, earlier we have proposed a method for oversampling the minority class using Kohonen's ASSOM. This investigation extends that study, clarifies some issues related to our earlier work, provides the algorithm for generation of the oversamples, applies the method on several benchmark data sets, and makes application to three Brain Computer Interface (BCI) applications. First we compare the performance of our method using some benchmark data sets with several state-of-the-art methods. Finally, we apply the ASSOM-based technique to analyze the three BCI based applications using electroencephalogram (EEG) datasets. These tasks are classification of motor imagery , drivers' fatigue states, and phases of migraine. Our results demonstrate the effectiveness of the ASSOM-based meth od in dealing with imbalance classification problem. △ Less

Submitted 7 October, 2020; v1 submitted 26 May, 2019; originally announced June 2019.

Comments: The current version is accepted by iFuzzy 2020

arXiv:1811.02616 [pdf, ps, other]

Multi-View Network Embedding Via Graph Factorization Clustering and Co-Regularized Multi-View Agreement

Authors: Yiwei Sun, Ngot Bui, Tsung-Yu Hsieh, Vasant Honavar

Abstract: Real-world social networks and digital platforms are comprised of individuals (nodes) that are linked to other individuals or entities through multiple types of relationships (links). Sub-networks of such a network based on each type of link correspond to distinct views of the underlying network. In real-world applications, each node is typically linked to only a small subset of other nodes. Hence… ▽ More Real-world social networks and digital platforms are comprised of individuals (nodes) that are linked to other individuals or entities through multiple types of relationships (links). Sub-networks of such a network based on each type of link correspond to distinct views of the underlying network. In real-world applications, each node is typically linked to only a small subset of other nodes. Hence, practical approaches to problems such as node labeling have to cope with the resulting sparse networks. While low-dimensional network embeddings offer a promising approach to this problem, most of the current network embedding methods focus primarily on single view networks. We introduce a novel multi-view network embedding (MVNE) algorithm for constructing low-dimensional node embeddings from multi-view networks. MVNE adapts and extends an approach to single view network embedding (SVNE) using graph factorization clustering (GFC) to the multi-view setting using an objective function that maximizes the agreement between views based on both the local and global structure of the underlying multi-view graph. Our experiments with several benchmark real-world single view networks show that GFC-based SVNE yields network embeddings that are competitive with or superior to those produced by the state-of-the-art single view network embedding methods when the embeddings are used for labeling unlabeled nodes in the networks. Our experiments with several multi-view networks show that MVNE substantially outperforms the single view methods on integrated view and the state-of-the-art multi-view methods. We further show that even when the goal is to predict labels of nodes within a single target view, MVNE outperforms its single-view counterpart suggesting that the MVNE is able to extract the information that is useful for labeling nodes in the target view from the all of the views. △ Less

Submitted 18 February, 2019; v1 submitted 6 November, 2018; originally announced November 2018.

Comments: ICDMW2018 -- IEEE International Conference on Data Mining workshop on Graph Analytics

arXiv:1810.12947 [pdf, other]

A Streamlined Encoder/Decoder Architecture for Melody Extraction

Authors: Tsung-Han Hsieh, Li Su, Yi-Hsuan Yang

Abstract: Melody extraction in polyphonic musical audio is important for music signal processing. In this paper, we propose a novel streamlined encoder/decoder network that is designed for the task. We make two technical contributions. First, drawing inspiration from a state-of-the-art model for semantic pixel-wise segmentation, we pass through the pooling indices between pooling and un-pooling layers to lo… ▽ More Melody extraction in polyphonic musical audio is important for music signal processing. In this paper, we propose a novel streamlined encoder/decoder network that is designed for the task. We make two technical contributions. First, drawing inspiration from a state-of-the-art model for semantic pixel-wise segmentation, we pass through the pooling indices between pooling and un-pooling layers to localize the melody in frequency. We can achieve result close to the state-of-the-art with much fewer convolutional layers and simpler convolution modules. Second, we propose a way to use the bottleneck layer of the network to estimate the existence of a melody line for each time frame, and make it possible to use a simple argmax function instead of ad-hoc thresholding to get the final estimation of the melody line. Our experiments on both vocal melody extraction and general melody extraction validate the effectiveness of the proposed model. △ Less

Submitted 18 February, 2019; v1 submitted 30 October, 2018; originally announced October 2018.

Comments: This is a pre-print version of an ICASSP 2019 paper

arXiv:1809.01225 [pdf, ps, other]

Compositional Stochastic Average Gradient for Machine Learning and Related Applications

Authors: Tsung-Yu Hsieh, Yasser EL-Manzalawy, Yiwei Sun, Vasant Honavar

Abstract: Many machine learning, statistical inference, and portfolio optimization problems require minimization of a composition of expected value functions (CEVF). Of particular interest is the finite-sum versions of such compositional optimization problems (FS-CEVF). Compositional stochastic variance reduced gradient (C-SVRG) methods that combine stochastic compositional gradient descent (SCGD) and stoch… ▽ More Many machine learning, statistical inference, and portfolio optimization problems require minimization of a composition of expected value functions (CEVF). Of particular interest is the finite-sum versions of such compositional optimization problems (FS-CEVF). Compositional stochastic variance reduced gradient (C-SVRG) methods that combine stochastic compositional gradient descent (SCGD) and stochastic variance reduced gradient descent (SVRG) methods are the state-of-the-art methods for FS-CEVF problems. We introduce compositional stochastic average gradient descent (C-SAG) a novel extension of the stochastic average gradient method (SAG) to minimize composition of finite-sum functions. C-SAG, like SAG, estimates gradient by incorporating memory of previous gradient information. We present theoretical analyses of C-SAG which show that C-SAG, like SAG, and C-SVRG, achieves a linear convergence rate when the objective function is strongly convex; However, C-CAG achieves lower oracle query complexity per iteration than C-SVRG. Finally, we present results of experiments showing that C-SAG converges substantially faster than full gradient (FG), as well as C-SVRG. △ Less

Submitted 7 September, 2018; v1 submitted 4 September, 2018; originally announced September 2018.

arXiv:1710.06842 [pdf]

Measuring the unmeasurable - a project of domestic violence risk prediction and management

Authors: Ya-Yun Chen, Chia-Kai Liu, Yu-Hsiu Wang, Sue-Chuan Chen, Yi-Shan Hsieh, **g-Tai Ke, T. C. Hsieh

Abstract: The prevention of domestic violence (DV) have aroused serious concerns in Taiwan because of the disparity between the increasing amount of reported DV cases that doubled over the past decade and the scarcity of social workers. Additionally, a large amount of data was collected when social workers use the predominant case management approach to document case reports information. However, these data… ▽ More The prevention of domestic violence (DV) have aroused serious concerns in Taiwan because of the disparity between the increasing amount of reported DV cases that doubled over the past decade and the scarcity of social workers. Additionally, a large amount of data was collected when social workers use the predominant case management approach to document case reports information. However, these data were not properly stored or organized. To improve the efficiency of DV prevention and risk management, we worked with Taipei City Government and utilized the 2015 data from its DV database to perform a spatial pattern analysis of the reports of DV cases to build a DV risk map. However, during our map building process, the issue of confounding bias arose because we were not able to verify if reported cases truly reflected real violence occurrence or were simply false reports from potential victim's neighbors. Therefore, we used the random forest method to build a repeat victimization risk prediction model. The accuracy and F1-measure of our model were 96.3% and 62.8%. This model helped social workers differentiate the risk level of new cases, which further reduced their major workload significantly. To our knowledge, this is the first project that utilized machine learning in DV prevention. The research approach and results of this project not only can improve DV prevention process, but also be applied to other social work or criminal prevention areas. △ Less

Submitted 18 October, 2017; originally announced October 2017.

Comments: Presented at the Data For Good Exchange 2017

arXiv:0804.4749 [pdf]

Study of improving nano-contouring performance by employing cross-coupling controller

Authors: Wen Yuh Jywe, Shih Shin Chen, Hung-Shu Wang, Chien Hung Liu, Hsin Hung Jwo, Yun Feng Teng, Tung Hsien Hsieh

Abstract: For the tracking stage path planning, we design a two-axis cross-coupling control system which uses the PI controller to compensate the contour error between axes. In this paper, the stage adoptive is designed by our laboratory (Precision Machine Center of National Formosa University). The cross-coupling controller calculates the actuating signal of each axis by combining multi-axes position err… ▽ More For the tracking stage path planning, we design a two-axis cross-coupling control system which uses the PI controller to compensate the contour error between axes. In this paper, the stage adoptive is designed by our laboratory (Precision Machine Center of National Formosa University). The cross-coupling controller calculates the actuating signal of each axis by combining multi-axes position error. Hence, the cross-coupling controller improves the stage tracking ability and decreases the contour error. The experiments show excellent stage motion. This finding confirms that the proposed method is a powerful and efficient tool for improving stage tracking ability. Also found were the stages tracking to minimize contour error of two types circular to approximately 25nm. △ Less

Submitted 30 April, 2008; originally announced April 2008.

Comments: Uploaded by ICIUS2007 Conference Organizer on behalf of the author(s). 6 pages, 8 figures

ACM Class: B.1.2

Journal ref: Proceedings of the International Conference on Intelligent Unmanned System (ICIUS 2007), Bali, Indonesia, October 24-25, 2007, Paper No. ICIUS2007-C002

Showing 1–29 of 29 results for author: Hsieh, T