Search | arXiv e-print repository

Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention Discriminator

Abstract: We present a novel speaker-independent acoustic-to-articulatory inversion (AAI) model, overcoming the limitations observed in conventional AAI models that rely on acoustic features derived from restricted datasets. To address these challenges, we leverage representations from a pre-trained self-supervised learning (SSL) model to more effectively estimate the global, local, and kinematic pattern in… ▽ More We present a novel speaker-independent acoustic-to-articulatory inversion (AAI) model, overcoming the limitations observed in conventional AAI models that rely on acoustic features derived from restricted datasets. To address these challenges, we leverage representations from a pre-trained self-supervised learning (SSL) model to more effectively estimate the global, local, and kinematic pattern information in Electromagnetic Articulography (EMA) signals during the AAI process. We train our model using an adversarial approach and introduce an attention-based Multi-duration phoneme discriminator (MDPD) designed to fully capture the intricate relationship among multi-channel articulatory signals. Our method achieves a Pearson correlation coefficient of 0.847, marking state-of-the-art performance in speaker-independent AAI models. The implementation details and code can be found online. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 2024

arXiv:2406.12688 [pdf, other]

Speak in the Scene: Diffusion-based Acoustic Scene Transfer toward Immersive Speech Generation

Authors: Miseul Kim, Soo-Whan Chung, Youna Ji, Hong-Goo Kang, Min-Seok Choi

Abstract: This paper introduces a novel task in generative speech processing, Acoustic Scene Transfer (AST), which aims to transfer acoustic scenes of speech signals to diverse environments. AST promises an immersive experience in speech perception by adapting the acoustic scene behind speech signals to desired environments. We propose AST-LDM for the AST task, which generates speech signals accompanied by… ▽ More This paper introduces a novel task in generative speech processing, Acoustic Scene Transfer (AST), which aims to transfer acoustic scenes of speech signals to diverse environments. AST promises an immersive experience in speech perception by adapting the acoustic scene behind speech signals to desired environments. We propose AST-LDM for the AST task, which generates speech signals accompanied by the target acoustic scene of the reference prompt. Specifically, AST-LDM is a latent diffusion model conditioned by CLAP embeddings that describe target acoustic scenes in either audio or text modalities. The contributions of this paper include introducing the AST task and implementing its baseline model. For AST-LDM, we emphasize its core framework, which is to preserve the input speech and generate audio consistently with both the given speech and the target acoustic environment. Experiments, including objective and subjective tests, validate the feasibility and efficacy of our approach. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

arXiv:2406.09819 [pdf, other]

Enhanced Deep Speech Separation in Clustered Ad Hoc Distributed Microphone Environments

Authors: Jihyun Kim, Stijn Kindt, Nilesh Madhu, Hong-Goo Kang

Abstract: Ad-hoc distributed microphone environments, where microphone locations and numbers are unpredictable, present a challenge to traditional deep learning models, which typically require fixed architectures. To tailor deep learning models to accommodate arbitrary array configurations, the Transform-Average-Concatenate (TAC) layer was previously introduced. In this work, we integrate TAC layers with du… ▽ More Ad-hoc distributed microphone environments, where microphone locations and numbers are unpredictable, present a challenge to traditional deep learning models, which typically require fixed architectures. To tailor deep learning models to accommodate arbitrary array configurations, the Transform-Average-Concatenate (TAC) layer was previously introduced. In this work, we integrate TAC layers with dual-path transformers for speech separation from two simultaneous talkers in realistic settings. However, the distributed nature makes it hard to fuse information across microphones efficiently. Therefore, we explore the efficacy of blindly clustering microphones around sources of interest prior to enhancement. Experimental results show that this deep cluster-informed approach significantly improves the system's capacity to cope with the inherent variability observed in ad-hoc distributed microphone environments. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

arXiv:2404.01330 [pdf, other]

Holo-VQVAE: VQ-VAE for phase-only holograms

Authors: Joohyun Park, Hyeongyeop Kang

Abstract: Holography stands at the forefront of visual technology innovation, offering immersive, three-dimensional visualizations through the manipulation of light wave amplitude and phase. Contemporary research in hologram generation has predominantly focused on image-to-hologram conversion, producing holograms from existing images. These approaches, while effective, inherently limit the scope of innovati… ▽ More Holography stands at the forefront of visual technology innovation, offering immersive, three-dimensional visualizations through the manipulation of light wave amplitude and phase. Contemporary research in hologram generation has predominantly focused on image-to-hologram conversion, producing holograms from existing images. These approaches, while effective, inherently limit the scope of innovation and creativity in hologram generation. In response to this limitation, we present Holo-VQVAE, a novel generative framework tailored for phase-only holograms (POHs). Holo-VQVAE leverages the architecture of Vector Quantized Variational AutoEncoders, enabling it to learn the complex distributions of POHs. Furthermore, it integrates the Angular Spectrum Method into the training process, facilitating learning in the image domain. This framework allows for the generation of unseen, diverse holographic content directly from its intricately learned latent space without requiring pre-existing images. This pioneering work paves the way for groundbreaking applications and methodologies in holographic content creation, opening a new era in the exploration of holographic content. △ Less

Submitted 29 March, 2024; originally announced April 2024.

arXiv:2402.09341 [pdf, other]

Registration of Longitudinal Spine CTs for Monitoring Lesion Growth

Authors: Malika Sanhinova, Nazim Haouchine, Steve D. Pieper, William M. Wells III, Tracy A. Balboni, Alexander Spektor, Mai Anh Huynh, Jeffrey P. Guenette, Bryan Czajkowski, Sarah Caplan, Patrick Doyle, Heejoo Kang, David B. Hackney, Ron N. Alkalay

Abstract: Accurate and reliable registration of longitudinal spine images is essential for assessment of disease progression and surgical outcome. Implementing a fully automatic and robust registration is crucial for clinical use, however, it is challenging due to substantial change in shape and appearance due to lesions. In this paper we present a novel method to automatically align longitudinal spine CTs… ▽ More Accurate and reliable registration of longitudinal spine images is essential for assessment of disease progression and surgical outcome. Implementing a fully automatic and robust registration is crucial for clinical use, however, it is challenging due to substantial change in shape and appearance due to lesions. In this paper we present a novel method to automatically align longitudinal spine CTs and accurately assess lesion progression. Our method follows a two-step pipeline where vertebrae are first automatically localized, labeled and 3D surfaces are generated using a deep learning model, then longitudinally aligned using a Gaussian mixture model surface registration. We tested our approach on 37 vertebrae, from 5 patients, with baseline CTs and 3, 6, and 12 months follow-ups leading to 111 registrations. Our experiment showed accurate registration with an average Hausdorff distance of 0.65 mm and average Dice score of 0.92. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: Paper accepted for publication at SPIE Medical Imaging 2024

arXiv:2312.13752 [pdf]

Hunting imaging biomarkers in pulmonary fibrosis: Benchmarks of the AIIB23 challenge

Authors: Yang Nan, Xiaodan Xing, Shiyi Wang, Zeyu Tang, Federico N Felder, Sheng Zhang, Roberta Eufrasia Ledda, Xiaoliu Ding, Ruiqi Yu, Wei** Liu, Feng Shi, Tianyang Sun, Zehong Cao, Minghui Zhang, Yun Gu, Hanxiao Zhang, Jian Gao, **yu Wang, Wen Tang, Pengxin Yu, Han Kang, Junqiang Chen, Xing Lu, Boyu Zhang, Michail Mamalakis , et al. (16 additional authors not shown)

Abstract: Airway-related quantitative imaging biomarkers are crucial for examination, diagnosis, and prognosis in pulmonary diseases. However, the manual delineation of airway trees remains prohibitively time-consuming. While significant efforts have been made towards enhancing airway modelling, current public-available datasets concentrate on lung diseases with moderate morphological variations. The intric… ▽ More Airway-related quantitative imaging biomarkers are crucial for examination, diagnosis, and prognosis in pulmonary diseases. However, the manual delineation of airway trees remains prohibitively time-consuming. While significant efforts have been made towards enhancing airway modelling, current public-available datasets concentrate on lung diseases with moderate morphological variations. The intricate honeycombing patterns present in the lung tissues of fibrotic lung disease patients exacerbate the challenges, often leading to various prediction errors. To address this issue, the 'Airway-Informed Quantitative CT Imaging Biomarker for Fibrotic Lung Disease 2023' (AIIB23) competition was organized in conjunction with the official 2023 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). The airway structures were meticulously annotated by three experienced radiologists. Competitors were encouraged to develop automatic airway segmentation models with high robustness and generalization abilities, followed by exploring the most correlated QIB of mortality prediction. A training set of 120 high-resolution computerised tomography (HRCT) scans were publicly released with expert annotations and mortality status. The online validation set incorporated 52 HRCT scans from patients with fibrotic lung disease and the offline test set included 140 cases from fibrosis and COVID-19 patients. The results have shown that the capacity of extracting airway trees from patients with fibrotic lung disease could be enhanced by introducing voxel-wise weighted general union loss and continuity loss. In addition to the competitive image biomarkers for prognosis, a strong airway-derived biomarker (Hazard ratio>1.5, p<0.0001) was revealed for survival prognostication compared with existing clinical measurements, clinician assessment and AI-based biomarkers. △ Less

Submitted 16 April, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: 19 pages

arXiv:2312.13615 [pdf, other]

Self-supervised Complex Network for Machine Sound Anomaly Detection

Authors: Miseul Kim, Minh Tri Ho, Hong-Goo Kang

Abstract: In this paper, we propose an anomaly detection algorithm for machine sounds with a deep complex network trained by self-supervision. Using the fact that phase continuity information is crucial for detecting abnormalities in time-series signals, our proposed algorithm utilizes the complex spectrum as an input and performs complex number arithmetic throughout the entire process. Since the usefulness… ▽ More In this paper, we propose an anomaly detection algorithm for machine sounds with a deep complex network trained by self-supervision. Using the fact that phase continuity information is crucial for detecting abnormalities in time-series signals, our proposed algorithm utilizes the complex spectrum as an input and performs complex number arithmetic throughout the entire process. Since the usefulness of phase information can vary depending on the type of machine sound, we also apply an attention mechanism to control the weights of the complex and magnitude spectrum bottleneck features depending on the machine type. We train our network to perform a self-supervised task that classifies the machine identifier (id) of normal input sounds among multiple classes. At test time, an input signal is detected as anomalous if the trained model is unable to correctly classify the id. In other words, we determine the presence of an anomality when the output cross-entropy score of the multiclass identification task is lower than a pre-defined threshold. Experiments with the MIMII dataset show that the proposed algorithm has a much higher area under the curve (AUC) score than conventional magnitude spectrum-based algorithms. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: Published in EUSIPCO 2021

arXiv:2312.13603 [pdf, other]

Style Modeling for Multi-Speaker Articulation-to-Speech

Authors: Miseul Kim, Zhenyu Piao, Jihyun Lee, Hong-Goo Kang

Abstract: In this paper, we propose a neural articulation-to-speech (ATS) framework that synthesizes high-quality speech from articulatory signal in a multi-speaker situation. Most conventional ATS approaches only focus on modeling contextual information of speech from a single speaker's articulatory features. To explicitly represent each speaker's speaking style as well as the contextual information, our p… ▽ More In this paper, we propose a neural articulation-to-speech (ATS) framework that synthesizes high-quality speech from articulatory signal in a multi-speaker situation. Most conventional ATS approaches only focus on modeling contextual information of speech from a single speaker's articulatory features. To explicitly represent each speaker's speaking style as well as the contextual information, our proposed model estimates style embeddings, guided from the essential speech style attributes such as pitch and energy. We adopt convolutional layers and transformer-based attention layers for our model to fully utilize both local and global information of articulatory signals, measured by electromagnetic articulography (EMA). Our model significantly improves the quality of synthesized speech compared to the baseline in terms of objective and subjective measurements in the Haskins dataset. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: 5 pages, Accepted to ICASSP 2023

arXiv:2312.13600 [pdf, other]

BrainTalker: Low-Resource Brain-to-Speech Synthesis with Transfer Learning using Wav2Vec 2.0

Authors: Miseul Kim, Zhenyu Piao, Jihyun Lee, Hong-Goo Kang

Abstract: Decoding spoken speech from neural activity in the brain is a fast-emerging research topic, as it could enable communication for people who have difficulties with producing audible speech. For this task, electrocorticography (ECoG) is a common method for recording brain activity with high temporal resolution and high spatial precision. However, due to the risky surgical procedure required for obta… ▽ More Decoding spoken speech from neural activity in the brain is a fast-emerging research topic, as it could enable communication for people who have difficulties with producing audible speech. For this task, electrocorticography (ECoG) is a common method for recording brain activity with high temporal resolution and high spatial precision. However, due to the risky surgical procedure required for obtaining ECoG recordings, relatively little of this data has been collected, and the amount is insufficient to train a neural network-based Brain-to-Speech (BTS) system. To address this problem, we propose BrainTalker-a novel BTS framework that generates intelligible spoken speech from ECoG signals under extremely low-resource scenarios. We apply a transfer learning approach utilizing a pre-trained self supervised model, Wav2Vec 2.0. Specifically, we train an encoder module to map ECoG signals to latent embeddings that match Wav2Vec 2.0 representations of the corresponding spoken speech. These embeddings are then transformed into mel-spectrograms using stacked convolutional and transformer-based layers, which are fed into a neural vocoder to synthesize speech waveform. Experimental results demonstrate our proposed framework achieves outstanding performance in terms of subjective and objective metrics, including a Pearson correlation coefficient of 0.9 between generated and ground truth mel spectrograms. We share publicly available Demos and Code. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: 5 pages. Accepted to BHI 2023

arXiv:2311.00364 [pdf, other]

C2C: Cough to COVID-19 Detection in BHI 2023 Data Challenge

Authors: Woo-** Chung, Miseul Kim, Hong-Goo Kang

Abstract: This report describes our submission to BHI 2023 Data Competition: Sensor challenge. Our Audio Alchemists team designed an acoustic-based COVID-19 diagnosis system, Cough to COVID-19 (C2C), and won the 1st place in the challenge. C2C involves three key contributions: pre-processing of input signals, cough-related representation extraction leveraging Wav2vec2.0, and data augmentation. Through exper… ▽ More This report describes our submission to BHI 2023 Data Competition: Sensor challenge. Our Audio Alchemists team designed an acoustic-based COVID-19 diagnosis system, Cough to COVID-19 (C2C), and won the 1st place in the challenge. C2C involves three key contributions: pre-processing of input signals, cough-related representation extraction leveraging Wav2vec2.0, and data augmentation. Through experimental findings, we demonstrate C2C's promising potential to enhance the diagnostic accuracy of COVID-19 via cough signals. Our proposed model achieves a ROC-AUC value of 0.7810 in the context of COVID-19 diagnosis. The implementation details and the python code can be found in the following link: https://github.com/Woo-**-Chung/BHI_2023_challenge_Audio_Alchemists △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: 1st place winning paper from the BHI 2023 Data Challenge Competition: Sensor Informatics

arXiv:2308.14909 [pdf, other]

doi 10.21437/Interspeech.2023-1301

Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech

Authors: Hyungchan Yoon, Changhwan Kim, Eunwoo Song, Hyun-Wook Yoon, Hong-Goo Kang

Abstract: For personalized speech generation, a neural text-to-speech (TTS) model must be successfully implemented with limited data from a target speaker. To this end, the baseline TTS model needs to be amply generalized to out-of-domain data (i.e., target speaker's speech). However, approaches to address this out-of-domain generalization problem in TTS have yet to be thoroughly studied. In this work, we p… ▽ More For personalized speech generation, a neural text-to-speech (TTS) model must be successfully implemented with limited data from a target speaker. To this end, the baseline TTS model needs to be amply generalized to out-of-domain data (i.e., target speaker's speech). However, approaches to address this out-of-domain generalization problem in TTS have yet to be thoroughly studied. In this work, we propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities. In particular, we prune off redundant connections from self-attention layers whose attention weights are below the threshold. To flexibly determine the pruning strength for searching optimal degree of generalization, we also propose a new differentiable pruning method that allows the model to automatically learn the thresholds. Evaluations on zero-shot multi-speaker TTS verify the effectiveness of our method in terms of voice quality and speaker similarity. △ Less

Submitted 28 August, 2023; originally announced August 2023.

Comments: INTERSPEECH 2023

Journal ref: Proc. INTERSPEECH 2023, 4299-4303

arXiv:2307.06961 [pdf, other]

Coordinated Path Following of UAVs using Event-Triggered Communication over Time-Varying Networks with Digraph Topologies

Authors: Hyungsoo Kang, Isaac Kaminer, Venanzio Cichella, Naira Hovakimyan

Abstract: In this article, a novel time-coordination algorithm based on event-triggered communications is proposed to achieve coordinated path-following of UAVs. To be specific, in the approach adopted a UAV transmits its progression information over a time-varying network to its neighbors only when a decentralized trigger condition is satisfied, thereby significantly reducing the volume of inter-vehicle co… ▽ More In this article, a novel time-coordination algorithm based on event-triggered communications is proposed to achieve coordinated path-following of UAVs. To be specific, in the approach adopted a UAV transmits its progression information over a time-varying network to its neighbors only when a decentralized trigger condition is satisfied, thereby significantly reducing the volume of inter-vehicle communications required when compared with the existing algorithms based on continuous communications. Using such intermittent communications, it is shown that a decentralized coordination controller guarantees exponential convergence of the coordination error to a neighborhood of zero. Also, a lower bound on the interval between two consecutive event-triggered times is provided showing that the chattering issue does not arise with the proposed algorithm. Finally, simulation results validate the efficacy of the proposed algorithm. △ Less

Submitted 13 July, 2023; originally announced July 2023.

Comments: arXiv admin note: text overlap with arXiv:2307.06553

arXiv:2307.06553 [pdf, other]

Coordinated Path Following of UAVs over Time-Varying Digraphs Connected in an Integral Sense

Authors: Hyungsoo Kang, Isaac Kaminer, Venanzio Cichella, Naira Hovakimyan

Abstract: This paper presents a new connectivity condition on the information flow between UAVs to achieve coordinated path following. The information flow is directional, so that the underlying communication network topology is represented by a time-varying digraph. We assume that this digraph is connected in an integral sense. This is a much more general assumption than the one currently used in the liter… ▽ More This paper presents a new connectivity condition on the information flow between UAVs to achieve coordinated path following. The information flow is directional, so that the underlying communication network topology is represented by a time-varying digraph. We assume that this digraph is connected in an integral sense. This is a much more general assumption than the one currently used in the literature. Under this assumption, it is shown that a decentralized coordination controller ensures exponential convergence of the coordination error vector to a neighborhood of zero. The efficacy of the algorithm is confirmed with simulation results. △ Less

Submitted 15 March, 2024; v1 submitted 13 July, 2023; originally announced July 2023.

arXiv:2306.09640 [pdf, other]

MF-PAM: Accurate Pitch Estimation through Periodicity Analysis and Multi-level Feature Fusion

Authors: Woo-** Chung, Doyeon Kim, Soo-Whan Chung, Hong-Goo Kang

Abstract: We introduce Multi-level feature Fusion-based Periodicity Analysis Model (MF-PAM), a novel deep learning-based pitch estimation model that accurately estimates pitch trajectory in noisy and reverberant acoustic environments. Our model leverages the periodic characteristics of audio signals and involves two key steps: extracting pitch periodicity using periodic non-periodic convolution (PNP-Conv) b… ▽ More We introduce Multi-level feature Fusion-based Periodicity Analysis Model (MF-PAM), a novel deep learning-based pitch estimation model that accurately estimates pitch trajectory in noisy and reverberant acoustic environments. Our model leverages the periodic characteristics of audio signals and involves two key steps: extracting pitch periodicity using periodic non-periodic convolution (PNP-Conv) blocks and estimating pitch by aggregating multi-level features using a modified bi-directional feature pyramid network (BiFPN). We evaluate our model on speech and music datasets and achieve superior pitch estimation performance compared to state-of-the-art baselines while using fewer model parameters. Our model achieves 99.20 % accuracy in pitch estimation on a clean musical dataset. Overall, our proposed model provides a promising solution for accurate pitch estimation in challenging acoustic environments and has potential applications in audio signal processing. △ Less

Submitted 16 June, 2023; originally announced June 2023.

Comments: accepted at INTERSPEECH 2023

arXiv:2306.08406 [pdf, other]

Feature Normalization for Fine-tuning Self-Supervised Models in Speech Enhancement

Authors: Hejung Yang, Hong-Goo Kang

Abstract: Large, pre-trained representation models trained using self-supervised learning have gained popularity in various fields of machine learning because they are able to extract high-quality salient features from input data. As such, they have been frequently used as base networks for various pattern classification tasks such as speech recognition. However, not much research has been conducted on appl… ▽ More Large, pre-trained representation models trained using self-supervised learning have gained popularity in various fields of machine learning because they are able to extract high-quality salient features from input data. As such, they have been frequently used as base networks for various pattern classification tasks such as speech recognition. However, not much research has been conducted on applying these types of models to the field of speech signal generation. In this paper, we investigate the feasibility of using pre-trained speech representation models for a downstream speech enhancement task. To alleviate mismatches between the input features of the pre-trained model and the target enhancement model, we adopt a novel feature normalization technique to smoothly link these modules together. Our proposed method enables significant improvements in speech quality compared to baselines when combined with various types of pre-trained speech models. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: INTERSPEECH 2023 accepted

arXiv:2306.01411 [pdf, other]

HD-DEMUCS: General Speech Restoration with Heterogeneous Decoders

Authors: Doyeon Kim, Soo-Whan Chung, Hyewon Han, Youna Ji, Hong-Goo Kang

Abstract: This paper introduces an end-to-end neural speech restoration model, HD-DEMUCS, demonstrating efficacy across multiple distortion environments. Unlike conventional approaches that employ cascading frameworks to remove undesirable noise first and then restore missing signal components, our model performs these tasks in parallel using two heterogeneous decoder networks. Based on the U-Net style enco… ▽ More This paper introduces an end-to-end neural speech restoration model, HD-DEMUCS, demonstrating efficacy across multiple distortion environments. Unlike conventional approaches that employ cascading frameworks to remove undesirable noise first and then restore missing signal components, our model performs these tasks in parallel using two heterogeneous decoder networks. Based on the U-Net style encoder-decoder framework, we attach an additional decoder so that each decoder network performs noise suppression or restoration separately. We carefully design each decoder architecture to operate appropriately depending on its objectives. Additionally, we improve performance by leveraging a learnable weighting factor, aggregating the two decoder output waveforms. Experimental results with objective metrics across various environments clearly demonstrate the effectiveness of our approach over a single decoder or multi-stage systems for general speech restoration task. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2305.06806 [pdf, other]

HappyQuokka System for ICASSP 2023 Auditory EEG Challenge

Authors: Zhenyu Piao, Miseul Kim, Hyungchan Yoon, Hong-Goo Kang

Abstract: This report describes our submission to Task 2 of the Auditory EEG Decoding Challenge at ICASSP 2023 Signal Processing Grand Challenge (SPGC). Task 2 is a regression problem that focuses on reconstructing a speech envelope from an EEG signal. For the task, we propose a pre-layer normalized feed-forward transformer (FFT) architecture. For within-subjects generation, we additionally utilize an auxil… ▽ More This report describes our submission to Task 2 of the Auditory EEG Decoding Challenge at ICASSP 2023 Signal Processing Grand Challenge (SPGC). Task 2 is a regression problem that focuses on reconstructing a speech envelope from an EEG signal. For the task, we propose a pre-layer normalized feed-forward transformer (FFT) architecture. For within-subjects generation, we additionally utilize an auxiliary global conditioner which provides our model with additional information about seen individuals. Experimental results show that our proposed method outperforms the VLAAI baseline and all other submitted systems. Notably, it demonstrates significant improvements on the within-subjects task, likely thanks to our use of the auxiliary global conditioner. In terms of evaluation metrics set by the challenge, we obtain Pearson correlation values of 0.1895 0.0869 for the within-subjects generation test and 0.0976 0.0444 for the heldout-subjects test. We release the training code for our model online. △ Less

Submitted 3 May, 2023; originally announced May 2023.

Comments: First Place in Task 2 of Auditory EEG decoding Challenge, which is part of ICASSP Signal Processing Grand Challenge (SPGC) 2023

arXiv:2305.06511 [pdf, other]

ParamNet: A Parameter-variable Network for Fast Stain Normalization

Authors: Hongtao Kang, Die Luo, Li Chen, Junbo Hu, Shenghua Cheng, Tingwei Quan, Shaoqun Zeng, Xiuli Liu

Abstract: In practice, digital pathology images are often affected by various factors, resulting in very large differences in color and brightness. Stain normalization can effectively reduce the differences in color and brightness of digital pathology images, thus improving the performance of computer-aided diagnostic systems. Conventional stain normalization methods rely on one or several reference images,… ▽ More In practice, digital pathology images are often affected by various factors, resulting in very large differences in color and brightness. Stain normalization can effectively reduce the differences in color and brightness of digital pathology images, thus improving the performance of computer-aided diagnostic systems. Conventional stain normalization methods rely on one or several reference images, but one or several images are difficult to represent the entire dataset. Although learning-based stain normalization methods are a general approach, they use complex deep networks, which not only greatly reduce computational efficiency, but also risk introducing artifacts. StainNet is a fast and robust stain normalization network, but it has not a sufficient capability for complex stain normalization due to its too simple network structure. In this study, we proposed a parameter-variable stain normalization network, ParamNet. ParamNet contains a parameter prediction sub-network and a color map** sub-network, where the parameter prediction sub-network can automatically determine the appropriate parameters for the color map** sub-network according to each input image. The feature of parameter variable ensures that our network has a sufficient capability for various stain normalization tasks. The color map** sub-network is a fully 1x1 convolutional network with a total of 59 variable parameters, which allows our network to be extremely computationally efficient and does not introduce artifacts. The results on cytopathology and histopathology datasets show that our ParamNet outperforms state-of-the-art methods and can effectively improve the generalization of classifiers on pathology diagnosis tasks. The code has been available at https://github.com/khtao/ParamNet. △ Less

Submitted 10 May, 2023; originally announced May 2023.

arXiv:2211.09385 [pdf, other]

ComMU: Dataset for Combinatorial Music Generation

Authors: Lee Hyun, Taehyun Kim, Hyolim Kang, Minjoo Ki, Hyeonchan Hwang, Kwanho Park, Sharang Han, Seon Joo Kim

Abstract: Commercial adoption of automatic music composition requires the capability of generating diverse and high-quality music suitable for the desired context (e.g., music for romantic movies, action games, restaurants, etc.). In this paper, we introduce combinatorial music generation, a new task to create varying background music based on given conditions. Combinatorial music generation creates short s… ▽ More Commercial adoption of automatic music composition requires the capability of generating diverse and high-quality music suitable for the desired context (e.g., music for romantic movies, action games, restaurants, etc.). In this paper, we introduce combinatorial music generation, a new task to create varying background music based on given conditions. Combinatorial music generation creates short samples of music with rich musical metadata, and combines them to produce a complete music. In addition, we introduce ComMU, the first symbolic music dataset consisting of short music samples and their corresponding 12 musical metadata for combinatorial music generation. Notable properties of ComMU are that (1) dataset is manually constructed by professional composers with an objective guideline that induces regularity, and (2) it has 12 musical metadata that embraces composers' intentions. Our results show that we can generate diverse high-quality music only with metadata, and that our unique metadata such as track-role and extended chord quality improves the capacity of the automatic composition. We highly recommend watching our video before reading the paper (https://pozalabs.github.io/ComMU). △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: 19 pages, 12 figures

arXiv:2210.03786 [pdf, ps, other]

doi 10.1007/978-3-031-16980-9_14

Evaluating the Performance of StyleGAN2-ADA on Medical Images

Authors: McKell Woodland, John Wood, Brian M. Anderson, Suprateek Kundu, Ethan Lin, Eugene Koay, Bruno Odisio, Caroline Chung, Hyunseon Christine Kang, Aradhana M. Venkatesan, Sireesha Yedururi, Brian De, Yuan-Mao Lin, Ankit B. Patel, Kristy K. Brock

Abstract: Although generative adversarial networks (GANs) have shown promise in medical imaging, they have four main limitations that impeded their utility: computational cost, data requirements, reliable evaluation measures, and training complexity. Our work investigates each of these obstacles in a novel application of StyleGAN2-ADA to high-resolution medical imaging datasets. Our dataset is comprised of… ▽ More Although generative adversarial networks (GANs) have shown promise in medical imaging, they have four main limitations that impeded their utility: computational cost, data requirements, reliable evaluation measures, and training complexity. Our work investigates each of these obstacles in a novel application of StyleGAN2-ADA to high-resolution medical imaging datasets. Our dataset is comprised of liver-containing axial slices from non-contrast and contrast-enhanced computed tomography (CT) scans. Additionally, we utilized four public datasets composed of various imaging modalities. We trained a StyleGAN2 network with transfer learning (from the Flickr-Faces-HQ dataset) and data augmentation (horizontal flip** and adaptive discriminator augmentation). The network's generative quality was measured quantitatively with the Fréchet Inception Distance (FID) and qualitatively with a visual Turing test given to seven radiologists and radiation oncologists. The StyleGAN2-ADA network achieved a FID of 5.22 ($\pm$ 0.17) on our liver CT dataset. It also set new record FIDs of 10.78, 3.52, 21.17, and 5.39 on the publicly available SLIVER07, ChestX-ray14, ACDC, and Medical Segmentation Decathlon (brain tumors) datasets. In the visual Turing test, the clinicians rated generated images as real 42% of the time, approaching random guessing. Our computational ablation study revealed that transfer learning and data augmentation stabilize training and improve the perceptual quality of the generated images. We observed the FID to be consistent with human perceptual evaluation of medical images. Finally, our work found that StyleGAN2-ADA consistently produces high-quality results without hyperparameter searches or retraining. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: This preprint has not undergone post-submission improvements or corrections. The Version of Record of this contribution is published in LNCS, volume 13570, and is available online at https://doi.org/10.1007/978-3-031-16980-9_14

Journal ref: Lecture Notes in Computer Science 13570 (2022)

arXiv:2206.15400 [pdf, other]

Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

Authors: Hyeon-Kyeong Shin, Hyewon Han, Doyeon Kim, Soo-Whan Chung, Hong-Goo Kang

Abstract: In this paper, we propose a novel end-to-end user-defined keyword spotting method that utilizes linguistically corresponding patterns between speech and text sequences. Unlike previous approaches requiring speech keyword enrollment, our method compares input queries with an enrolled text keyword sequence. To place the audio and text representations within a common latent space, we adopt an attenti… ▽ More In this paper, we propose a novel end-to-end user-defined keyword spotting method that utilizes linguistically corresponding patterns between speech and text sequences. Unlike previous approaches requiring speech keyword enrollment, our method compares input queries with an enrolled text keyword sequence. To place the audio and text representations within a common latent space, we adopt an attention-based cross-modal matching approach that is trained in an end-to-end manner with monotonic matching loss and keyword classification loss. We also utilize a de-noising loss for the acoustic embedding network to improve robustness in noisy environments. Additionally, we introduce the LibriPhrase dataset, a new short-phrase dataset based on LibriSpeech for efficiently training keyword spotting models. Our proposed method achieves competitive results on various evaluation sets compared to other single-modal and cross-modal baselines. △ Less

Submitted 1 July, 2022; v1 submitted 30 June, 2022; originally announced June 2022.

Comments: Accepted to Interspeech 2022

arXiv:2206.06267 [pdf, other]

doi 10.1109/EMBC48229.2022.9871639

MMMNA-Net for Overall Survival Time Prediction of Brain Tumor Patients

Authors: Wen Tang, Haoyue Zhang, Pengxin Yu, Han Kang, Rongguo Zhang

Abstract: Overall survival (OS) time is one of the most important evaluation indices for gliomas situations. Multimodal Magnetic Resonance Imaging (MRI) scans play an important role in the study of glioma prognosis OS time. Several deep learning-based methods are proposed for the OS time prediction on multi-modal MRI problems. However, these methods usually fuse multi-modal information at the beginning or a… ▽ More Overall survival (OS) time is one of the most important evaluation indices for gliomas situations. Multimodal Magnetic Resonance Imaging (MRI) scans play an important role in the study of glioma prognosis OS time. Several deep learning-based methods are proposed for the OS time prediction on multi-modal MRI problems. However, these methods usually fuse multi-modal information at the beginning or at the end of the deep learning networks and lack the fusion of features from different scales. In addition, the fusion at the end of networks always adapts global with global (eg. fully connected after concatenation of global average pooling output) or local with local (eg. bilinear pooling), which loses the information of local with global. In this paper, we propose a novel method for multi-modal OS time prediction of brain tumor patients, which contains an improved nonlocal features fusion module introduced on different scales. Our method obtains a relative 8.76% improvement over the current state-of-art method (0.6989 vs. 0.6426 on accuracy). Extensive testing demonstrates that our method could adapt to situations with missing modalities. The code is available at https://github.com/TangWen920812/mmmna-net. △ Less

Submitted 13 June, 2022; originally announced June 2022.

Comments: Accepted EMBC 2022

arXiv:2206.06253 [pdf, ps, other]

doi 10.1007/978-3-031-16446-0_33

RPLHR-CT Dataset and Transformer Baseline for Volumetric Super-Resolution from CT Scans

Authors: Pengxin Yu, Haoyue Zhang, Han Kang, Wen Tang, Corey W. Arnold, Rongguo Zhang

Abstract: In clinical practice, anisotropic volumetric medical images with low through-plane resolution are commonly used due to short acquisition time and lower storage cost. Nevertheless, the coarse resolution may lead to difficulties in medical diagnosis by either physicians or computer-aided diagnosis algorithms. Deep learning-based volumetric super-resolution (SR) methods are feasible ways to improve r… ▽ More In clinical practice, anisotropic volumetric medical images with low through-plane resolution are commonly used due to short acquisition time and lower storage cost. Nevertheless, the coarse resolution may lead to difficulties in medical diagnosis by either physicians or computer-aided diagnosis algorithms. Deep learning-based volumetric super-resolution (SR) methods are feasible ways to improve resolution, with convolutional neural networks (CNN) at their core. Despite recent progress, these methods are limited by inherent properties of convolution operators, which ignore content relevance and cannot effectively model long-range dependencies. In addition, most of the existing methods use pseudo-paired volumes for training and evaluation, where pseudo low-resolution (LR) volumes are generated by a simple degradation of their high-resolution (HR) counterparts. However, the domain gap between pseudo- and real-LR volumes leads to the poor performance of these methods in practice. In this paper, we build the first public real-paired dataset RPLHR-CT as a benchmark for volumetric SR, and provide baseline results by re-implementing four state-of-the-art CNN-based methods. Considering the inherent shortcoming of CNN, we also propose a transformer volumetric super-resolution network (TVSRN) based on attention mechanisms, dispensing with convolutions entirely. This is the first research to use a pure transformer for CT volumetric SR. The experimental results show that TVSRN significantly outperforms all baselines on both PSNR and SSIM. Moreover, the TVSRN method achieves a better trade-off between the image quality, the number of parameters, and the running time. Data and code are available at https://github.com/smilenaxx/RPLHR-CT. △ Less

Submitted 13 June, 2022; originally announced June 2022.

Comments: Accepted MICCAI 2022

arXiv:2205.15720 [pdf, other]

Progressive Multi-scale Consistent Network for Multi-class Fundus Lesion Segmentation

Authors: Along He, Kai Wang, Tao Li, Wang Bo, Hong Kang, Huazhu Fu

Abstract: Effectively integrating multi-scale information is of considerable significance for the challenging multi-class segmentation of fundus lesions because different lesions vary significantly in scales and shapes. Several methods have been proposed to successfully handle the multi-scale object segmentation. However, two issues are not considered in previous studies. The first is the lack of interactio… ▽ More Effectively integrating multi-scale information is of considerable significance for the challenging multi-class segmentation of fundus lesions because different lesions vary significantly in scales and shapes. Several methods have been proposed to successfully handle the multi-scale object segmentation. However, two issues are not considered in previous studies. The first is the lack of interaction between adjacent feature levels, and this will lead to the deviation of high-level features from low-level features and the loss of detailed cues. The second is the conflict between the low-level and high-level features, this occurs because they learn different scales of features, thereby confusing the model and decreasing the accuracy of the final prediction. In this paper, we propose a progressive multi-scale consistent network (PMCNet) that integrates the proposed progressive feature fusion (PFF) block and dynamic attention block (DAB) to address the aforementioned issues. Specifically, PFF block progressively integrates multi-scale features from adjacent encoding layers, facilitating feature learning of each layer by aggregating fine-grained details and high-level semantics. As features at different scales should be consistent, DAB is designed to dynamically learn the attentive cues from the fused features at different scales, thus aiming to smooth the essential conflicts existing in multi-scale features. The two proposed PFF and DAB blocks can be integrated with the off-the-shelf backbone networks to address the two issues of multi-scale and feature inconsistency in the multi-class segmentation of fundus lesions, which will produce better feature representation in the feature space. Experimental results on three public datasets indicate that the proposed method is more effective than recent state-of-the-art methods. △ Less

Submitted 31 May, 2022; originally announced May 2022.

arXiv:2205.04104 [pdf, other]

ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence

Authors: Sangshin Oh, Seyun Um, Hong-Goo Kang

Abstract: The Gumbel-softmax distribution, or Concrete distribution, is often used to relax the discrete characteristics of a categorical distribution and enable back-propagation through differentiable reparameterization. Although it reliably yields low variance gradients, it still relies on a stochastic sampling process for optimization. In this work, we present a relaxed categorical analytic bound (ReCAB)… ▽ More The Gumbel-softmax distribution, or Concrete distribution, is often used to relax the discrete characteristics of a categorical distribution and enable back-propagation through differentiable reparameterization. Although it reliably yields low variance gradients, it still relies on a stochastic sampling process for optimization. In this work, we present a relaxed categorical analytic bound (ReCAB), a novel divergence-like metric which corresponds to the upper bound of the Kullback-Leibler divergence (KLD) of a relaxed categorical distribution. The proposed metric is easy to implement because it has a closed form solution, and empirical results show that it is close to the actual KLD. Along with this new metric, we propose a relaxed categorical analytic bound variational autoencoder (ReCAB-VAE) that successfully models both continuous and relaxed discrete latent representations. We implement an emotional text-to-speech synthesis system based on the proposed framework, and show that the proposed system flexibly and stably controls emotion expressions with better speech quality compared to baselines that use stochastic estimation or categorical distribution approximation. △ Less

Submitted 9 May, 2022; originally announced May 2022.

arXiv:2205.01528 [pdf, other]

Attentive activation function for improving end-to-end spoofing countermeasure systems

Authors: Woo Hyun Kang, Jahangir Alam, Abderrahim Fathan

Abstract: The main objective of the spoofing countermeasure system is to detect the artifacts within the input speech caused by the speech synthesis or voice conversion process. In order to achieve this, we propose to adopt an attentive activation function, more specifically attention rectified linear unit (AReLU) to the end-to-end spoofing countermeasure system. Since the AReLU employs the attention mechan… ▽ More The main objective of the spoofing countermeasure system is to detect the artifacts within the input speech caused by the speech synthesis or voice conversion process. In order to achieve this, we propose to adopt an attentive activation function, more specifically attention rectified linear unit (AReLU) to the end-to-end spoofing countermeasure system. Since the AReLU employs the attention mechanism to boost the contribution of relevant input features while suppressing the irrelevant ones, introducing AReLU can help the countermeasure system to focus on the features related to the artifacts. The proposed framework was experimented on the logical access (LA) task of ASVSpoof2019 dataset, and outperformed the systems using the standard non-learnable activation functions. △ Less

Submitted 3 May, 2022; originally announced May 2022.

arXiv:2204.03315 [pdf, ps, other]

Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-trained DNN-HMM-Based Acoustic-Phonetic Model

Authors: Nick J. C. Wang, Lu Wang, Yandan Sun, Haimei Kang, Dejun Zhang

Abstract: In spoken language understanding (SLU), what the user says is converted to his/her intent. Recent work on end-to-end SLU has shown that accuracy can be improved via pre-training approaches. We revisit ideas presented by Lugosch et al. using speech pre-training and three-module modeling; however, to ease construction of the end-to-end SLU model, we use as our phoneme module an open-source acoustic-… ▽ More In spoken language understanding (SLU), what the user says is converted to his/her intent. Recent work on end-to-end SLU has shown that accuracy can be improved via pre-training approaches. We revisit ideas presented by Lugosch et al. using speech pre-training and three-module modeling; however, to ease construction of the end-to-end SLU model, we use as our phoneme module an open-source acoustic-phonetic model from a DNN-HMM hybrid automatic speech recognition (ASR) system instead of training one from scratch. Hence we fine-tune on speech only for the word module, and we apply multi-target learning (MTL) on the word and intent modules to jointly optimize SLU performance. MTL yields a relative reduction of 40% in intent-classification error rates (from 1.0% to 0.6%). Note that our three-module model is a streaming method. The final outcome of the proposed three-module modeling approach yields an intent accuracy of 99.4% on FluentSpeech, an intent error rate reduction of 50% compared to that of Lugosch et al. Although we focus on real-time streaming methods, we also list non-streaming methods for comparison. △ Less

Submitted 7 April, 2022; originally announced April 2022.

Comments: Published in INTERSPEECH 2021

arXiv:2204.02172 [pdf, other]

doi 10.21437/Interspeech.2023-1571

Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

Authors: Hyungchan Yoon, Seyun Um, Changwhan Kim, Hong-Goo Kang

Abstract: To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding \emph{prosody embeddings} to the latent representations… ▽ More To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding \emph{prosody embeddings} to the latent representations. During training, we extract reference prosody embeddings from mel-spectrograms, and during inference, we estimate these embeddings from text using generative adversarial networks (GANs). Using GANs, we reliably estimate the prosody embeddings in a fast way, which have complex distributions due to the dynamic nature of speech. We also show that the prosody embeddings work as efficient features for learning a robust alignment between text and acoustic features. Our proposed model surpasses several publicly available models with less parameters and computational complexity in comparative experiments. △ Less

Submitted 28 August, 2023; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: INTERSPEECH 2023

MSC Class: 68T07 (Primary) 68T50; 68T99 (Secondary) ACM Class: I.2.7; I.2.6

arXiv:2203.05125 [pdf, ps, other]

A Lifted $\ell_1 $ Framework for Sparse Recovery

Authors: Yaghoub Rahimi, Sung Ha Kang, Yifei Lou

Abstract: Motivated by re-weighted $\ell_1$ approaches for sparse recovery, we propose a lifted $\ell_1$ (LL1) regularization which is a generalized form of several popular regularizations in the literature. By exploring such connections, we discover there are two types of lifting functions which can guarantee that the proposed approach is equivalent to the $\ell_0$ minimization. Computationally, we design… ▽ More Motivated by re-weighted $\ell_1$ approaches for sparse recovery, we propose a lifted $\ell_1$ (LL1) regularization which is a generalized form of several popular regularizations in the literature. By exploring such connections, we discover there are two types of lifting functions which can guarantee that the proposed approach is equivalent to the $\ell_0$ minimization. Computationally, we design an efficient algorithm via the alternating direction method of multiplier (ADMM) and establish the convergence for an unconstrained formulation. Experimental results are presented to demonstrate how this generalization improves sparse recovery over the state-of-the-art. △ Less

Submitted 12 May, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: 24 pages

MSC Class: 65K10; 49N45; 65F50; 90C90; 49M20

arXiv:2203.02181 [pdf, other]

MANNER: Multi-view Attention Network for Noise Erasure

Authors: Hyun Joon Park, Byung Ha Kang, Wooseok Shin, ** Sob Kim, Sung Won Han

Abstract: In the field of speech enhancement, time domain methods have difficulties in achieving both high performance and efficiency. Recently, dual-path models have been adopted to represent long sequential features, but they still have limited representations and poor memory efficiency. In this study, we propose Multi-view Attention Network for Noise ERasure (MANNER) consisting of a convolutional encoder… ▽ More In the field of speech enhancement, time domain methods have difficulties in achieving both high performance and efficiency. Recently, dual-path models have been adopted to represent long sequential features, but they still have limited representations and poor memory efficiency. In this study, we propose Multi-view Attention Network for Noise ERasure (MANNER) consisting of a convolutional encoder-decoder with a multi-view attention block, applied to the time-domain signals. MANNER efficiently extracts three different representations from noisy speech and estimates high-quality clean speech. We evaluated MANNER on the VoiceBank-DEMAND dataset in terms of five objective speech quality metrics. Experimental results show that MANNER achieves state-of-the-art performance while efficiently processing noisy speech. △ Less

Submitted 4 March, 2022; originally announced March 2022.

Comments: To appear in ICASSP 2022

arXiv:2202.11918 [pdf, other]

Phase Continuity: Learning Derivatives of Phase Spectrum for Speech Enhancement

Authors: Doyeon Kim, Hyewon Han, Hyeon-Kyeong Shin, Soo-Whan Chung, Hong-Goo Kang

Abstract: Modern neural speech enhancement models usually include various forms of phase information in their training loss terms, either explicitly or implicitly. However, these loss terms are typically designed to reduce the distortion of phase spectrum values at specific frequencies, which ensures they do not significantly affect the quality of the enhanced speech. In this paper, we propose an effective… ▽ More Modern neural speech enhancement models usually include various forms of phase information in their training loss terms, either explicitly or implicitly. However, these loss terms are typically designed to reduce the distortion of phase spectrum values at specific frequencies, which ensures they do not significantly affect the quality of the enhanced speech. In this paper, we propose an effective phase reconstruction strategy for neural speech enhancement that can operate in noisy environments. Specifically, we introduce a phase continuity loss that considers relative phase variations across the time and frequency axes. By including this phase continuity loss in a state-of-the-art neural speech enhancement system trained with reconstruction loss and a number of magnitude spectral losses, we show that our proposed method further improves the quality of enhanced speech signals over the baseline, especially when training is done jointly with a magnitude spectrum loss. △ Less

Submitted 24 February, 2022; originally announced February 2022.

Comments: Accepted by ICASSP 2022

arXiv:2201.10283 [pdf, ps, other]

SASV Challenge 2022: A Spoofing Aware Speaker Verification Challenge Evaluation Plan

Authors: Jee-weon Jung, Hemlata Tak, Hye-** Shim, Hee-Soo Heo, Bong-** Lee, Soo-Whan Chung, Hong-Goo Kang, Ha-** Yu, Nicholas Evans, Tomi Kinnunen

Abstract: ASV (automatic speaker verification) systems are intrinsically required to reject both non-target (e.g., voice uttered by different speaker) and spoofed (e.g., synthesised or converted) inputs. However, there is little consideration for how ASV systems themselves should be adapted when they are expected to encounter spoofing attacks, nor when they operate in tandem with CMs (spoofing countermeasur… ▽ More ASV (automatic speaker verification) systems are intrinsically required to reject both non-target (e.g., voice uttered by different speaker) and spoofed (e.g., synthesised or converted) inputs. However, there is little consideration for how ASV systems themselves should be adapted when they are expected to encounter spoofing attacks, nor when they operate in tandem with CMs (spoofing countermeasures), much less how both systems should be jointly optimised. The goal of the first SASV (spoofing-aware speaker verification) challenge, a special sesscion in ISCA INTERSPEECH 2022, is to promote development of integrated systems that can perform ASV and CM simultaneously. △ Less

Submitted 2 March, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

Comments: Evaluation plan of the SASV Challenge 2022. See this webpage for more information: https://sasv-challenge.github.io

arXiv:2112.08222 [pdf, other]

Guaranteed Nonlinear Tracking in the Presence of DNN-Learned Dynamics With Contraction Metrics and Disturbance Estimation

Authors: Pan Zhao, Ziyao Guo, Aditya Gahlawat, Hyungsoo Kang, Naira Hovakimyan

Abstract: This paper presents an approach to trajectory-centric learning control based on contraction metrics and disturbance estimation for nonlinear systems subject to matched uncertainties. The approach uses deep neural networks to learn uncertain dynamics while still providing guarantees of transient tracking performance throughout the learning phase. Within the proposed approach, a disturbance estimati… ▽ More This paper presents an approach to trajectory-centric learning control based on contraction metrics and disturbance estimation for nonlinear systems subject to matched uncertainties. The approach uses deep neural networks to learn uncertain dynamics while still providing guarantees of transient tracking performance throughout the learning phase. Within the proposed approach, a disturbance estimation law is adopted to estimate the pointwise value of the uncertainty, with pre-computable estimation error bounds (EEBs). The learned dynamics, the estimated disturbances, and the EEBs are then incorporated in a robust Riemann energy condition to compute the control law that guarantees exponential convergence of actual trajectories to desired ones throughout the learning phase, even when the learned model is poor. On the other hand, with improved accuracy, the learned model can help improve the robustness of the tracking controller, e.g., against input delays, and can be incorporated to plan better trajectories with improved performance, e.g., lower energy consumption and shorter travel time.The proposed framework is validated on a planar quadrotor example. △ Less

Submitted 12 October, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

Comments: Shorter version submitted to ACC 2023

arXiv:2112.03454 [pdf, other]

Robust Speech Representation Learning via Flow-based Embedding Regularization

Authors: Woo Hyun Kang, Jahangir Alam, Abderrahim Fathan

Abstract: Over the recent years, various deep learning-based methods were proposed for extracting a fixed-dimensional embedding vector from speech signals. Although the deep learning-based embedding extraction methods have shown good performance in numerous tasks including speaker verification, language identification and anti-spoofing, their performance is limited when it comes to mismatched conditions due… ▽ More Over the recent years, various deep learning-based methods were proposed for extracting a fixed-dimensional embedding vector from speech signals. Although the deep learning-based embedding extraction methods have shown good performance in numerous tasks including speaker verification, language identification and anti-spoofing, their performance is limited when it comes to mismatched conditions due to the variability within them unrelated to the main task. In order to alleviate this problem, we propose a novel training strategy that regularizes the embedding network to have minimum information about the nuisance attributes. To achieve this, our proposed method directly incorporates the information bottleneck scheme into the training process, where the mutual information is estimated using the main task classifier and an auxiliary normalizing flow network. The proposed method was evaluated on different speech processing tasks and showed improvement over the standard training strategy in all experimentation. △ Less

Submitted 6 December, 2021; originally announced December 2021.

arXiv:2109.14779 [pdf, other]

Time Coordination of Multiple UAVs over Switching Communication Networks with Digraph Topologies

Authors: Hyungsoo Kang, Hyung-** Yoon, Venanzio Cichella, Naira Hovakimyan, Petros Voulgaris

Abstract: This paper presents a time-coordination algorithm for multiple UAVs executing cooperative missions. Unlike previous algorithms, it does not rely on the assumption that the communication between UAVs is bidirectional. Thus, the topology of the inter-UAV information flow can be characterized by digraphs. To achieve coordination with weak connectivity, we design a switching law that orchestrates swit… ▽ More This paper presents a time-coordination algorithm for multiple UAVs executing cooperative missions. Unlike previous algorithms, it does not rely on the assumption that the communication between UAVs is bidirectional. Thus, the topology of the inter-UAV information flow can be characterized by digraphs. To achieve coordination with weak connectivity, we design a switching law that orchestrates switching between jointly connected digraph topologies. In accordance with the law, the UAVs with a transmitter switch the topology of their coordination information flow. A Lyapunov analysis shows that a decentralized coordination controller steers coordination errors to a neighborhood of zero. Simulation results illustrate that the algorithm attains coordination objectives with significantly reduced inter-UAV communication compared to previous work. △ Less

Submitted 12 April, 2022; v1 submitted 29 September, 2021; originally announced September 2021.

arXiv:2108.08346 [pdf]

Permanent Magnet Linear Generator Design for Surface Riding Wave Energy Converters

Authors: Farid Naghavi, Shrikesh Sheshaprasad, Matthew Gardner, Aghamarshana Meduri, HeonYong Kang, Hamid Toliyat

Abstract: This paper describes the detailed analysis for the design of a linear generator developed for a Surface Riding Wave Energy Converter (SR-WEC), which was designed to improve energy capture over a wider range of sea states. The study starts with an analysis of the power take-off (PTO) control strategy to harness the maximum output power from given sea states. Passive, reactive, and discrete PTO cont… ▽ More This paper describes the detailed analysis for the design of a linear generator developed for a Surface Riding Wave Energy Converter (SR-WEC), which was designed to improve energy capture over a wider range of sea states. The study starts with an analysis of the power take-off (PTO) control strategy to harness the maximum output power from given sea states. Passive, reactive, and discrete PTO control are explored. For the random wave excitation and limited sliding distance of the generator, the discrete strategy provides the highest average power output. The paper discusses the sizing requirement for the linear generator. Based on the force and power rating of the system and the application requirements, a slotless permanent magnet tubular generator is designed for the wave energy converter. △ Less

Submitted 18 August, 2021; originally announced August 2021.

Comments: To be published in Energy Conversion Congress and Expo 2021

arXiv:2107.12003 [pdf, other]

Facetron: A Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

Authors: Se-Yun Um, Jihyun Kim, Jihyun Lee, Hong-Goo Kang

Abstract: In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from li… ▽ More In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results. Specifically, we evaluate the performances of linguistic features by measuring their accuracy on an automatic speech recognition task. In addition, we estimate speaker and gender similarity for multi-speaker and unseen conditions, respectively. We also evaluate the aturalness of the synthesized speech waveforms using a mean opinion score (MOS) test and non-intrusive objective speech quality assessment (NISQA).The demo samples of the proposed and other models are available at https://sam-0927.github.io/ △ Less

Submitted 15 March, 2023; v1 submitted 26 July, 2021; originally announced July 2021.

Comments: 5 pages (including references), 1 figure

arXiv:2104.02775 [pdf, other]

Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation

Authors: Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, Kwanghoon Sohn

Abstract: In this paper, we address the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video. Thus, their performance heavily depends on the accuracy of audio-visual synchronization and the effectiveness of their representations. To… ▽ More In this paper, we address the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video. Thus, their performance heavily depends on the accuracy of audio-visual synchronization and the effectiveness of their representations. To overcome the frame discontinuity problem between two modalities due to transmission delay mismatch or jitter, we propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams. Given that the global term provides stability over a temporal sequence at the utterance-level, this resolves the label permutation problem characterized by inconsistent assignments. By extending the proposed cross-modal affinity on the complex network, we further improve the separation performance in the complex spectral domain. Experimental results verify that the proposed methods outperform conventional ones on various datasets, demonstrating their advantages in real-world scenarios. △ Less

Submitted 25 March, 2021; originally announced April 2021.

Comments: CVPR 2021. The first two authors contributed equally to this work. Project page: https://caffnet.github.io

arXiv:2103.12926 [pdf, other]

Beyond Visual Attractiveness: Physically Plausible Single Image HDR Reconstruction for Spherical Panoramas

Authors: Wei Wei, Li Guan, Yue Liu, Hao Kang, Haoxiang Li, Ying Wu, Gang Hua

Abstract: HDR reconstruction is an important task in computer vision with many industrial needs. The traditional approaches merge multiple exposure shots to generate HDRs that correspond to the physical quantity of illuminance of the scene. However, the tedious capturing process makes such multi-shot approaches inconvenient in practice. In contrast, recent single-shot methods predict a visually appealing HD… ▽ More HDR reconstruction is an important task in computer vision with many industrial needs. The traditional approaches merge multiple exposure shots to generate HDRs that correspond to the physical quantity of illuminance of the scene. However, the tedious capturing process makes such multi-shot approaches inconvenient in practice. In contrast, recent single-shot methods predict a visually appealing HDR from a single LDR image through deep learning. But it is not clear whether the previously mentioned physical properties would still hold, without training the network to explicitly model them. In this paper, we introduce the physical illuminance constraints to our single-shot HDR reconstruction framework, with a focus on spherical panoramas. By the proposed physical regularization, our method can generate HDRs which are not only visually appealing but also physically plausible. For evaluation, we collect a large dataset of LDR and HDR images with ground truth illuminance measures. Extensive experiments show that our HDR images not only maintain high visual quality but also top all baseline methods in illuminance prediction accuracy. △ Less

Submitted 23 March, 2021; originally announced March 2021.

arXiv:2102.03542

Continuous Monitoring of Blood Pressure with Evidential Regression

Authors: Hyeongju Kim, Woo Hyun Kang, Hyeonseung Lee, Nam Soo Kim

Abstract: Photoplethysmogram (PPG) signal-based blood pressure (BP) estimation is a promising candidate for modern BP measurements, as PPG signals can be easily obtained from wearable devices in a non-invasive manner, allowing quick BP measurement. However, the performance of existing machine learning-based BP measuring methods still fall behind some BP measurement guidelines and most of them provide only p… ▽ More Photoplethysmogram (PPG) signal-based blood pressure (BP) estimation is a promising candidate for modern BP measurements, as PPG signals can be easily obtained from wearable devices in a non-invasive manner, allowing quick BP measurement. However, the performance of existing machine learning-based BP measuring methods still fall behind some BP measurement guidelines and most of them provide only point estimates of systolic blood pressure (SBP) and diastolic blood pressure (DBP). In this paper, we present a cutting-edge method which is capable of continuously monitoring BP from the PPG signal and satisfies healthcare criteria such as the Association for the Advancement of Medical Instrumentation (AAMI) and the British Hypertension Society (BHS) standards. Furthermore, the proposed method provides the reliability of the predicted BP by estimating its uncertainty to help diagnose medical condition based on the model prediction. Experiments on the MIMIC II database verify the state-of-the-art performance of the proposed method under several metrics and its ability to accurately represent uncertainty in prediction. △ Less

Submitted 25 February, 2021; v1 submitted 6 February, 2021; originally announced February 2021.

Comments: We found some errors in the experimental configuration. We plan to revise the paper and republish it later

arXiv:2101.09864 [pdf, other]

Applications of Deep Learning in Fundus Images: A Review

Authors: Tao Li, Wang Bo, Chunyu Hu, Hong Kang, Hanruo Liu, Kai Wang, Huazhu Fu

Abstract: The use of fundus images for the early screening of eye diseases is of great clinical importance. Due to its powerful performance, deep learning is becoming more and more popular in related applications, such as lesion segmentation, biomarkers segmentation, disease diagnosis and image synthesis. Therefore, it is very necessary to summarize the recent developments in deep learning for fundus images… ▽ More The use of fundus images for the early screening of eye diseases is of great clinical importance. Due to its powerful performance, deep learning is becoming more and more popular in related applications, such as lesion segmentation, biomarkers segmentation, disease diagnosis and image synthesis. Therefore, it is very necessary to summarize the recent developments in deep learning for fundus images with a review paper. In this review, we introduce 143 application papers with a carefully designed hierarchy. Moreover, 33 publicly available datasets are presented. Summaries and analyses are provided for each task. Finally, limitations common to all tasks are revealed and possible solutions are given. We will also release and regularly update the state-of-the-art results and newly-released datasets at https://github.com/nkicsl/Fundus Review to adapt to the rapid development of this field. △ Less

Submitted 24 January, 2021; originally announced January 2021.

Journal ref: Medical Image Analysis 2021

arXiv:2012.12535 [pdf]

doi 10.3389/fmed.2021.746307

StainNet: a fast and robust stain normalization network

Authors: Hongtao Kang, Die Luo, Weihua Feng, Junbo Hu, Shaoqun Zeng, Tingwei Quan, Xiuli Liu

Abstract: Stain normalization often refers to transferring the color distribution of the source image to that of the target image and has been widely used in biomedical image analysis. The conventional stain normalization is regarded as constructing a pixel-by-pixel color map** model, which only depends on one reference image, and can not accurately achieve the style transformation between image datasets.… ▽ More Stain normalization often refers to transferring the color distribution of the source image to that of the target image and has been widely used in biomedical image analysis. The conventional stain normalization is regarded as constructing a pixel-by-pixel color map** model, which only depends on one reference image, and can not accurately achieve the style transformation between image datasets. In principle, this style transformation can be well solved by the deep learning-based methods due to its complicated network structure, whereas, its complicated structure results in the low computational efficiency and artifacts in the style transformation, which has restricted the practical application. Here, we use distillation learning to reduce the complexity of deep learning methods and a fast and robust network called StainNet to learn the color map** between the source image and target image. StainNet can learn the color map** relationship from a whole dataset and adjust the color value in a pixel-to-pixel manner. The pixel-to-pixel manner restricts the network size and avoids artifacts in the style transformation. The results on the cytopathology and histopathology datasets show that StainNet can achieve comparable performance to the deep learning-based methods. Computation results demonstrate StainNet is more than 40 times faster than StainGAN and can normalize a 100,000x100,000 whole slide image in 40 seconds. △ Less

Submitted 23 July, 2021; v1 submitted 23 December, 2020; originally announced December 2020.

Comments: 14 pages, 8 figures

Journal ref: Front. Med. 8:746307 (2021)

arXiv:2011.02168 [pdf, other]

Learning in your voice: Non-parallel voice conversion based on speaker consistency loss

Authors: Yoohwan Kwon, Soo-Whan Chung, Hee-Soo Heo, Hong-Goo Kang

Abstract: In this paper, we propose a novel voice conversion strategy to resolve the mismatch between the training and conversion scenarios when parallel speech corpus is unavailable for training. Based on auto-encoder and disentanglement frameworks, we design the proposed model to extract identity and content representations while reconstructing the input speech signal itself. Since we use other speaker's… ▽ More In this paper, we propose a novel voice conversion strategy to resolve the mismatch between the training and conversion scenarios when parallel speech corpus is unavailable for training. Based on auto-encoder and disentanglement frameworks, we design the proposed model to extract identity and content representations while reconstructing the input speech signal itself. Since we use other speaker's identity information in the training process, the training philosophy is naturally matched with the objective of voice conversion process. In addition, we effectively design the disentanglement framework to reliably preserve linguistic information and to enhance the quality of converted speech signals. The superiority of the proposed method is shown in subjective listening tests as well as objective measures. △ Less

Submitted 4 November, 2020; originally announced November 2020.

Comments: ICASSP 2021 submitted

arXiv:2010.11433 [pdf, other]

Unsupervised Representation Learning for Speaker Recognition via Contrastive Equilibrium Learning

Authors: Sung Hwan Mun, Woo Hyun Kang, Min Hyun Han, Nam Soo Kim

Abstract: In this paper, we propose a simple but powerful unsupervised learning method for speaker recognition, namely Contrastive Equilibrium Learning (CEL), which increases the uncertainty on nuisance factors latent in the embeddings by employing the uniformity loss. Also, to preserve speaker discriminability, a contrastive similarity loss function is used together. Experimental results showed that the pr… ▽ More In this paper, we propose a simple but powerful unsupervised learning method for speaker recognition, namely Contrastive Equilibrium Learning (CEL), which increases the uncertainty on nuisance factors latent in the embeddings by employing the uniformity loss. Also, to preserve speaker discriminability, a contrastive similarity loss function is used together. Experimental results showed that the proposed CEL significantly outperforms the state-of-the-art unsupervised speaker verification systems and the best performing model achieved 8.01% and 4.01% EER on VoxCeleb1 and VOiCES evaluation sets, respectively. On top of that, the performance of the supervised speaker embedding networks trained with initial parameters pre-trained via CEL showed better performance than those trained with randomly initialized parameters. △ Less

Submitted 22 October, 2020; originally announced October 2020.

Comments: 5 pages, 1 figure, 4 tables

arXiv:2010.11408 [pdf, ps, other]

Robust Text-Dependent Speaker Verification via Character-Level Information Preservation for the SdSV Challenge 2020

Authors: Sung Hwan Mun, Woo Hyun Kang, Min Hyun Han, Nam Soo Kim

Abstract: This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020. Task 1 is a text-dependent speaker verification task, where both the speaker and phrase are required to be verified. The submitted systems were composed of TDNN-based and ResNet-based front-end architectures, in which the frame-level features were aggregated with various pooling methods… ▽ More This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020. Task 1 is a text-dependent speaker verification task, where both the speaker and phrase are required to be verified. The submitted systems were composed of TDNN-based and ResNet-based front-end architectures, in which the frame-level features were aggregated with various pooling methods (e.g., statistical, self-attentive, ghostVLAD pooling). Although the conventional pooling methods provide embeddings with a sufficient amount of speaker-dependent information, our experiments show that these embeddings often lack phrase-dependent information. To mitigate this problem, we propose a new pooling and score compensation methods that leverage a CTC-based automatic speech recognition (ASR) model for taking the lexical content into account. Both methods showed improvement over the conventional techniques, and the best performance was achieved by fusing all the experimented systems, which showed 0.0785% MinDCF and 2.23% EER on the challenge's evaluation subset. △ Less

Submitted 21 October, 2020; originally announced October 2020.

Comments: Accepted in INTERSPEECH 2020

arXiv:2008.12912 [pdf, other]

Multi-Attention Based Ultra Lightweight Image Super-Resolution

Authors: Abdul Muqeet, Jiwon Hwang, Subin Yang, Jung Heum Kang, Yongwoo Kim, Sung-Ho Bae

Abstract: Lightweight image super-resolution (SR) networks have the utmost significance for real-world applications. There are several deep learning based SR methods with remarkable performance, but their memory and computational cost are hindrances in practical usage. To tackle this problem, we propose a Multi-Attentive Feature Fusion Super-Resolution Network (MAFFSRN). MAFFSRN consists of proposed feature… ▽ More Lightweight image super-resolution (SR) networks have the utmost significance for real-world applications. There are several deep learning based SR methods with remarkable performance, but their memory and computational cost are hindrances in practical usage. To tackle this problem, we propose a Multi-Attentive Feature Fusion Super-Resolution Network (MAFFSRN). MAFFSRN consists of proposed feature fusion groups (FFGs) that serve as a feature extraction block. Each FFG contains a stack of proposed multi-attention blocks (MAB) that are combined in a novel feature fusion structure. Further, the MAB with a cost-efficient attention mechanism (CEA) helps us to refine and extract the features using multiple attention mechanisms. The comprehensive experiments show the superiority of our model over the existing state-of-the-art. We participated in AIM 2020 efficient SR challenge with our MAFFSRN model and won 1st, 3rd, and 4th places in memory usage, floating-point operations (FLOPs) and number of parameters, respectively. △ Less

Submitted 21 September, 2020; v1 submitted 29 August, 2020; originally announced August 2020.

Comments: ECCVW AIM2020

arXiv:2008.03024 [pdf, other]

doi 10.1109/ACCESS.2020.3012893

Disentangled speaker and nuisance attribute embedding for robust speaker verification

Authors: Woo Hyun Kang, Sung Hwan Mun, Min Hyun Han, Nam Soo Kim

Abstract: Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states)… ▽ More Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). In this paper, we propose a novel fully supervised training method for extracting a speaker embedding vector disentangled from the variability caused by the nuisance attributes. The proposed framework was compared with the conventional deep learning-based embedding methods using the RSR2015 and VoxCeleb1 dataset. Experimental results show that the proposed approach can extract speaker embeddings robust to channel and emotional variability. △ Less

Submitted 7 August, 2020; originally announced August 2020.

Comments: Accepted in IEEE Access

arXiv:2008.01698 [pdf, other]

MIRNet: Learning multiple identities representations in overlapped speech

Authors: Hyewon Han, Soo-Whan Chung, Hong-Goo Kang

Abstract: Many approaches can derive information about a single speaker's identity from the speech by learning to recognize consistent characteristics of acoustic parameters. However, it is challenging to determine identity information when there are multiple concurrent speakers in a given signal. In this paper, we propose a novel deep speaker representation strategy that can reliably extract multiple speak… ▽ More Many approaches can derive information about a single speaker's identity from the speech by learning to recognize consistent characteristics of acoustic parameters. However, it is challenging to determine identity information when there are multiple concurrent speakers in a given signal. In this paper, we propose a novel deep speaker representation strategy that can reliably extract multiple speaker identities from an overlapped speech. We design a network that can extract a high-level embedding that contains information about each speaker's identity from a given mixture. Unlike conventional approaches that need reference acoustic features for training, our proposed algorithm only requires the speaker identity labels of the overlapped speech segments. We demonstrate the effectiveness and usefulness of our algorithm in a speaker verification task and a speech separation system conditioned on the target speaker embeddings obtained through the proposed method. △ Less

Submitted 6 August, 2020; v1 submitted 4 August, 2020; originally announced August 2020.

Comments: Accepted in Interspeech 2020

arXiv:2008.01348 [pdf, other]

Intra-class variation reduction of speaker representation in disentanglement framework

Authors: Yoohwan Kwon, Soo-Whan Chung, Hong-Goo Kang

Abstract: In this paper, we propose an effective training strategy to ex-tract robust speaker representations from a speech signal. Oneof the key challenges in speaker recognition tasks is to learnlatent representations or embeddings containing solely speakercharacteristic information in order to be robust in terms of intra-speaker variations. By modifying the network architecture togenerate both speaker-re… ▽ More In this paper, we propose an effective training strategy to ex-tract robust speaker representations from a speech signal. Oneof the key challenges in speaker recognition tasks is to learnlatent representations or embeddings containing solely speakercharacteristic information in order to be robust in terms of intra-speaker variations. By modifying the network architecture togenerate both speaker-related and speaker-unrelated representa-tions, we exploit a learning criterion which minimizes the mu-tual information between these disentangled embeddings. Wealso introduce an identity change loss criterion which utilizes areconstruction error to different utterances spoken by the samespeaker. Since the proposed criteria reduce the variation ofspeaker characteristics caused by changes in background envi-ronment or spoken content, the resulting embeddings of eachspeaker become more consistent. The effectiveness of the pro-posed method is demonstrated through two tasks; disentangle-ment performance, and improvement of speaker recognition ac-curacy compared to the baseline model on a benchmark dataset,VoxCeleb1. Ablation studies also show the impact of each cri-terion on overall performance. △ Less

Submitted 4 August, 2020; originally announced August 2020.

Comments: Accepted for INTERSPEECH 2020

arXiv:2007.12903 [pdf, other]

doi 10.24963/ijcai.2020/518

Robust Front-End for Multi-Channel ASR using Flow-Based Density Estimation

Authors: Hyeongju Kim, Hyeonseung Lee, Woo Hyun Kang, Hyung Yong Kim, Nam Soo Kim

Abstract: For multi-channel speech recognition, speech enhancement techniques such as denoising or dereverberation are conventionally applied as a front-end processor. Deep learning-based front-ends using such techniques require aligned clean and noisy speech pairs which are generally obtained via data simulation. Recently, several joint optimization techniques have been proposed to train the front-end with… ▽ More For multi-channel speech recognition, speech enhancement techniques such as denoising or dereverberation are conventionally applied as a front-end processor. Deep learning-based front-ends using such techniques require aligned clean and noisy speech pairs which are generally obtained via data simulation. Recently, several joint optimization techniques have been proposed to train the front-end without parallel data within an end-to-end automatic speech recognition (ASR) scheme. However, the ASR objective is sub-optimal and insufficient for fully training the front-end, which still leaves room for improvement. In this paper, we propose a novel approach which incorporates flow-based density estimation for the robust front-end using non-parallel clean and noisy speech. Experimental results on the CHiME-4 dataset show that the proposed method outperforms the conventional techniques where the front-end is trained only with ASR objective. △ Less

Submitted 25 July, 2020; originally announced July 2020.

Comments: 7 pages, 3 figures

Journal ref: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, {IJCAI} 2020

Showing 1–50 of 72 results for author: Kang, H