Search | arXiv e-print repository

Papez: Resource-Efficient Speech Separation with Auditory Working Memory

Authors: Hyunseok Oh, Juheon Yi, Youngki Lee

Abstract: Transformer-based models recently reached state-of-the-art single-channel speech separation accuracy; However, their extreme computational load makes it difficult to deploy them in resource-constrained mobile or IoT devices. We thus present Papez, a lightweight and computation-efficient single-channel speech separation model. Papez is based on three key techniques. We first replace the inter-chunk… ▽ More Transformer-based models recently reached state-of-the-art single-channel speech separation accuracy; However, their extreme computational load makes it difficult to deploy them in resource-constrained mobile or IoT devices. We thus present Papez, a lightweight and computation-efficient single-channel speech separation model. Papez is based on three key techniques. We first replace the inter-chunk Transformer with small-sized auditory working memory. Second, we adaptively prune the input tokens that do not need further processing. Finally, we reduce the number of parameters through the recurrent transformer. Our extensive evaluation shows that Papez achieves the best resource and accuracy tradeoffs with a large margin. We publicly share our source code at \texttt{https://github.com/snuhcs/Papez} △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: 5 pages. Accepted by ICASSP 2023

arXiv:2406.07803 [pdf, other]

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Authors: Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee

Abstract: Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressi… ▽ More Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model ability to control emotional style and intensity with high-quality expressive speech. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted at INTERSPEECH 2024

arXiv:2404.04096 [pdf, other]

Machine Learning-Aided Cooperative Localization under Dense Urban Environment

Authors: Hoon Lee, Hong Ki Kim, Seung Hyun Oh, Sang Hyun Lee

Abstract: Future wireless network technology provides automobiles with the connectivity feature to consolidate the concept of vehicular networks that collaborate on conducting cooperative driving tasks. The full potential of connected vehicles, which promises road safety and quality driving experience, can be leveraged if machine learning models guarantee the robustness in performing core functions includin… ▽ More Future wireless network technology provides automobiles with the connectivity feature to consolidate the concept of vehicular networks that collaborate on conducting cooperative driving tasks. The full potential of connected vehicles, which promises road safety and quality driving experience, can be leveraged if machine learning models guarantee the robustness in performing core functions including localization and controls. Location awareness, in particular, lends itself to the deployment of location-specific services and the improvement of the operation performance. The localization entails direct communication to the network infrastructure, and the resulting centralized positioning solutions readily become intractable as the network scales up. As an alternative to the centralized solutions, this article addresses decentralized principle of vehicular localization reinforced by machine learning techniques in dense urban environments with frequent inaccessibility to reliable measurement. As such, the collaboration of multiple vehicles enhances the positioning performance of machine learning approaches. A virtual testbed is developed to validate this machine learning model for real-map vehicular networks. Numerical results demonstrate universal feasibility of cooperative localization, in particular, for dense urban area configurations. △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2401.08095 [pdf, other]

DurFlex-EVC: Duration-Flexible Emotional Voice Conversion with Parallel Generation

Authors: Hyung-Seok Oh, Sang-Hoon Lee, Deok-Hyeon Cho, Seong-Whan Lee

Abstract: Emotional voice conversion (EVC) seeks to modify the emotional tone of a speaker's voice while preserving the original linguistic content and the speaker's unique vocal characteristics. Recent advancements in EVC have involved the simultaneous modeling of pitch and duration, utilizing the potential of sequence-to-sequence (seq2seq) models. To enhance reliability and efficiency in conversion, this… ▽ More Emotional voice conversion (EVC) seeks to modify the emotional tone of a speaker's voice while preserving the original linguistic content and the speaker's unique vocal characteristics. Recent advancements in EVC have involved the simultaneous modeling of pitch and duration, utilizing the potential of sequence-to-sequence (seq2seq) models. To enhance reliability and efficiency in conversion, this study shifts focus towards parallel speech generation. We introduce Duration-Flexible EVC (DurFlex-EVC), which integrates a style autoencoder and unit aligner. Traditional models, while incorporating self-supervised learning (SSL) representations that contain both linguistic and paralinguistic information, have neglected this dual nature, leading to reduced controllability. Addressing this issue, we implement cross-attention to synchronize these representations with various emotions. Additionally, a style autoencoder is developed for the disentanglement and manipulation of style elements. The efficacy of our approach is validated through both subjective and objective evaluations, establishing its superiority over existing models in the field. △ Less

Submitted 7 March, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

Comments: 13 pages, 9 figures, 8 tables

arXiv:2401.06913 [pdf, other]

Microphone Conversion: Mitigating Device Variability in Sound Event Classification

Authors: Myeonghoon Ryu, Hongseok Oh, Suji Lee, Han Park

Abstract: In this study, we introduce a new augmentation technique to enhance the resilience of sound event classification (SEC) systems against device variability through the use of CycleGAN. We also present a unique dataset to evaluate this method. As SEC systems become increasingly common, it is crucial that they work well with audio from diverse recording devices. Our method addresses limited device div… ▽ More In this study, we introduce a new augmentation technique to enhance the resilience of sound event classification (SEC) systems against device variability through the use of CycleGAN. We also present a unique dataset to evaluate this method. As SEC systems become increasingly common, it is crucial that they work well with audio from diverse recording devices. Our method addresses limited device diversity in training data by enabling unpaired training to transform input spectrograms as if they are recorded on a different device. Our experiments show that our approach outperforms existing methods in generalization by 5.2% - 11.5% in weighted f1 score. Additionally, it surpasses the current methods in adaptability across diverse recording devices by achieving a 6.5% - 12.8% improvement in weighted f1 score. △ Less

Submitted 12 January, 2024; originally announced January 2024.

Comments: Accepted to ICASSP 2024

arXiv:2312.04382 [pdf, other]

Adversarial Denoising Diffusion Model for Unsupervised Anomaly Detection

Authors: Jongmin Yu, Hyeontaek Oh, **hong Yang

Abstract: In this paper, we propose the Adversarial Denoising Diffusion Model (ADDM). The ADDM is based on the Denoising Diffusion Probabilistic Model (DDPM) but complementarily trained by adversarial learning. The proposed adversarial learning is achieved by classifying model-based denoised samples and samples to which random Gaussian noise is added to a specific sampling step. With the addition of explici… ▽ More In this paper, we propose the Adversarial Denoising Diffusion Model (ADDM). The ADDM is based on the Denoising Diffusion Probabilistic Model (DDPM) but complementarily trained by adversarial learning. The proposed adversarial learning is achieved by classifying model-based denoised samples and samples to which random Gaussian noise is added to a specific sampling step. With the addition of explicit adversarial learning on data samples, ADDM can learn the semantic characteristics of the data more robustly during training, which achieves a similar data sampling performance with much fewer sampling steps than DDPM. We apply ADDM to anomaly detection in unsupervised MRI images. Experimental results show that the proposed ADDM outperformed existing generative model-based unsupervised anomaly detection methods. In particular, compared to other DDPM-based anomaly detection methods, the proposed ADDM shows better performance with the same number of sampling steps and similar performance with 50% fewer sampling steps. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: Accepted for the poster session of DGM4H worshop on NeuralPS 2023

arXiv:2308.05992 [pdf, other]

Reachable Set-based Path Planning for Automated Vertical Parking System

Authors: In Hyuk Oh, Ju Won Seo, ** Sung Kim, Chung Choo Chung

Abstract: This paper proposes a local path planning method with a reachable set for Automated vertical Parking Systems (APS). First, given a parking lot layout with a goal position, we define an intermediate pose for the APS to accomplish reverse parking with a single maneuver, i.e., without changing the gear shift. Then, we introduce a reachable set which is a set of points consisting of the grid points of… ▽ More This paper proposes a local path planning method with a reachable set for Automated vertical Parking Systems (APS). First, given a parking lot layout with a goal position, we define an intermediate pose for the APS to accomplish reverse parking with a single maneuver, i.e., without changing the gear shift. Then, we introduce a reachable set which is a set of points consisting of the grid points of all possible intermediate poses. Once the APS approaches the goal position, it must select an intermediate pose in the reachable set. A minimization problem was formulated and solved to choose the intermediate pose. We performed various scenarios with different parking lot conditions. We used the Hybrid-A* algorithm for the global path planning to move the vehicle from the starting pose to the intermediate pose and utilized clothoid-based local path planning to move from the intermediate pose to the goal pose. Additionally, we designed a controller to follow the generated path and validated its tracking performance. It was confirmed that the tracking error in the mean root square for the lateral position was bounded within 0.06m and for orientation within 0.01rad. △ Less

Submitted 11 August, 2023; originally announced August 2023.

Comments: 8 pages, 10 figures, conference. This is the Accepted Manuscript version of an article accepted for publication in [IEEE International Conference on Intelligent Transportation Systems ITSC 2023]. IOP Publishing Ltd is not responsible for any errors or omissions in this version of the manuscript or any version derived from it. No information about DOI has been posted yet

arXiv:2307.16549 [pdf, other]

DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

Authors: Hyung-Seok Oh, Sang-Hoon Lee, Seong-Whan Lee

Abstract: Expressive text-to-speech systems have undergone significant advancements owing to prosody modeling, but conventional methods can still be improved. Traditional approaches have relied on the autoregressive method to predict the quantized prosody vector; however, it suffers from the issues of long-term dependency and slow inference. This study proposes a novel approach called DiffProsody in which e… ▽ More Expressive text-to-speech systems have undergone significant advancements owing to prosody modeling, but conventional methods can still be improved. Traditional approaches have relied on the autoregressive method to predict the quantized prosody vector; however, it suffers from the issues of long-term dependency and slow inference. This study proposes a novel approach called DiffProsody in which expressive speech is synthesized using a diffusion-based latent prosody generator and prosody conditional adversarial training. Our findings confirm the effectiveness of our prosody generator in generating a prosody vector. Furthermore, our prosody conditional discriminator significantly improves the quality of the generated speech by accurately emulating prosody. We use denoising diffusion generative adversarial networks to improve the prosody generation speed. Consequently, DiffProsody is capable of generating prosody 16 times faster than the conventional diffusion model. The superior performance of our proposed method has been demonstrated via experiments. △ Less

Submitted 31 July, 2023; originally announced July 2023.

Comments: 10 pages, 8 figures, 5 tables, under review

arXiv:2307.16171 [pdf, other]

HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

Authors: Sang-Hoon Lee, Ha-Yeong Choi, Hyung-Seok Oh, Seong-Whan Lee

Abstract: Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervis… ▽ More Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{https://hiervst.github.io/}. △ Less

Submitted 30 July, 2023; originally announced July 2023.

Comments: INTERSPEECH 2023 (Oral)

arXiv:2208.07422 [pdf, other]

Deep Unsupervised Domain Adaptation: A Review of Recent Advances and Perspectives

Authors: Xiaofeng Liu, Chaehwa Yoo, Fangxu Xing, Hye** Oh, Georges El Fakhri, Je-Won Kang, Jonghye Woo

Abstract: Deep learning has become the method of choice to tackle real-world problems in different domains, partly because of its ability to learn from data and achieve impressive performance on a wide range of applications. However, its success usually relies on two assumptions: (i) vast troves of labeled datasets are required for accurate model fitting, and (ii) training and testing data are independent a… ▽ More Deep learning has become the method of choice to tackle real-world problems in different domains, partly because of its ability to learn from data and achieve impressive performance on a wide range of applications. However, its success usually relies on two assumptions: (i) vast troves of labeled datasets are required for accurate model fitting, and (ii) training and testing data are independent and identically distributed. Its performance on unseen target domains, thus, is not guaranteed, especially when encountering out-of-distribution data at the adaptation stage. The performance drop on data in a target domain is a critical problem in deploying deep neural networks that are successfully trained on data in a source domain. Unsupervised domain adaptation (UDA) is proposed to counter this, by leveraging both labeled source domain data and unlabeled target domain data to carry out various tasks in the target domain. UDA has yielded promising results on natural image processing, video analysis, natural language processing, time-series data analysis, medical image analysis, etc. In this review, as a rapidly evolving topic, we provide a systematic comparison of its methods and applications. In addition, the connection of UDA with its closely related tasks, e.g., domain generalization and out-of-distribution detection, has also been discussed. Furthermore, deficiencies in current methods and possible promising directions are highlighted. △ Less

Submitted 15 August, 2022; originally announced August 2022.

Comments: APSIPA Transactions on Signal and Information Processing

arXiv:2012.02753 [pdf, other]

Model-plant mismatch learning offset-free model predictive control

Authors: Sang Hwan Son, Jong Woo Kim, Tae Hoon Oh, Jong Min Lee

Abstract: We propose model-plant mismatch learning offset-free model predictive control (MPC), which learns and applies the intrinsic model-plant mismatch, to effectively exploit the advantages of model-based and data-driven control strategies and overcome the limitations of each approach. In this study, the model-plant mismatch map on steady-state manifold in the controlled variable space is approximated v… ▽ More We propose model-plant mismatch learning offset-free model predictive control (MPC), which learns and applies the intrinsic model-plant mismatch, to effectively exploit the advantages of model-based and data-driven control strategies and overcome the limitations of each approach. In this study, the model-plant mismatch map on steady-state manifold in the controlled variable space is approximated via a general regression neural network from the steady-state data for each setpoint. Though the learned model-plant mismatch map can provide the information at the equilibrium point (i.e., setpoint), it cannot provide model-plant mismatch information during the transient state. Moreover, the intrinsic model-plant mismatch can vary due to system characteristics changes during operation. Therefore, we additionally apply a supplementary disturbance variable which is updated from the disturbance estimator based on the nominal offset-free MPC scheme. Then, the combined disturbance signal is applied to the target problem and finite-horizon optimal control problem of offset-free MPC to improve the prediction accuracy and closed-loop performance of the controller. By this, we can exploit both the learned model-plant mismatch information and the stabilizing property of the nominal disturbance estimator approach. The closed-loop simulation results demonstrate that the developed scheme can properly learn the intrinsic model-plant mismatch and efficiently improve the model-plant mismatch compensating performance in offset-free MPC. Moreover, we examine the robust asymptotic stability of the developed offset-free MPC scheme, which is known to be difficult to analyze in nominal offset-free MPC, by exploiting the learned model-plant mismatch information. △ Less

Submitted 13 December, 2020; v1 submitted 4 December, 2020; originally announced December 2020.

arXiv:2006.00284 [pdf]

Unit Commitment Considering the Impact of Deep Cycling

Authors: HyungSeon Oh

Abstract: Wind energy has been integrated into the power system with the hope that it improves the energy efficiency and decreases greenhouse gas emission. However, several studies over the world imply that the result was in the opposite way that was hoped mainly because of the negative correlation between wind availability and load. Under the situation, coal power plants are forced to cycle while they are… ▽ More Wind energy has been integrated into the power system with the hope that it improves the energy efficiency and decreases greenhouse gas emission. However, several studies over the world imply that the result was in the opposite way that was hoped mainly because of the negative correlation between wind availability and load. Under the situation, coal power plants are forced to cycle while they are not designed to do so. To prevent this unwanted result from occurring, a unit commitment decision should include the use of fuel and the emission rate during the ramp up/down process. This paper proposes a new unit commitment decision process to accommodate the economic and the environmental costs associated with the ram** process. The costs are, in general, not convex because there is positive cost if a generator output changes significantly regardless of directions. As a result, the problem might be nonconvex. A piece-wise linear cost curve is introduced to model the impact of ram** processes. With the curve, a convex linear programming is formulated, and the impact of a governmental policy is discussed. △ Less

Submitted 30 May, 2020; originally announced June 2020.

Comments: 25 pages, 10 figures

arXiv:1911.11113 [pdf]

doi 10.1371/journal.pone.0225097

Analytical solution to swing equations in power grids

Authors: HyungSeon Oh

Abstract: Objective: To derive a closed-form analytical solution to the swing equation describing the power system dynamics, which is a nonlinear second order differential equation. Existing challenges: No analytical solution to the swing equation has been identified, due to the complex nature of power systems. Two major approaches are pursued for stability assessments on systems: (1) computationally simple… ▽ More Objective: To derive a closed-form analytical solution to the swing equation describing the power system dynamics, which is a nonlinear second order differential equation. Existing challenges: No analytical solution to the swing equation has been identified, due to the complex nature of power systems. Two major approaches are pursued for stability assessments on systems: (1) computationally simple models based on physically unacceptable assumptions, and (2) digital simulations with high computational costs. Motivation: The motion of the rotor angle that the swing equation describes is a vector function. Often, a simple form of the physical laws is revealed by coordinate transformation. Methods: The study included the formulation of the swing equation in the Cartesian coordinate system, which is different from conventional approaches that describe the equation in the polar coordinate system. Based on the properties and operational conditions of electric power grids referred to in the literature, we identified the swing equation in the Cartesian coordinate system and derived an analytical solution within a validity region. Results: The estimated results from the analytical solution derived in this study agree with the results using conventional methods, which indicates the derived analytical solution is correct. Conclusion: An analytical solution to the swing equation is derived without unphysical assumptions, and the closed-form solution correctly estimates the dynamics after a fault occurs. △ Less

Submitted 25 November, 2019; originally announced November 2019.

Comments: Corrected version of the published paper at PLoS ONE

Journal ref: published, 2019

arXiv:1904.03643 [pdf, other]

Ensemble Patch Transformation: A New Tool for Signal Decomposition

Authors: Donghoh Kim, Guebin Choi, Hee-Seok Oh

Abstract: This paper considers the problem of signal decomposition and data visualization. For this purpose, we introduce a new multiscale transform, termed `ensemble patch transformation' that enhances identification of local characteristics embedded in a signal and provides multiscale visualization according to different levels; hence, it is useful for data analysis and signal decomposition. In literature… ▽ More This paper considers the problem of signal decomposition and data visualization. For this purpose, we introduce a new multiscale transform, termed `ensemble patch transformation' that enhances identification of local characteristics embedded in a signal and provides multiscale visualization according to different levels; hence, it is useful for data analysis and signal decomposition. In literature, there are data-adaptive decomposition methods such as empirical mode decomposition (EMD) by Huang et al. (1998). Along the same line of EMD, we propose a new decomposition algorithm that extracts meaningful components from a signal that belongs to a large class of signals, compared to the previous methods. Some theoretical properties of the proposed algorithm are investigated. To evaluate the proposed method, we analyze several synthetic examples and a real-world signal. △ Less

Submitted 7 April, 2019; originally announced April 2019.

Comments: 32 pages with 24 figures

arXiv:1706.00795 [pdf]

Situational Awareness with PMUs and SCADA

Authors: HyungSeon Oh

Abstract: Phasor measurement units (PMUs) are integrated to the transmission networks under the smart grid umbrella. The observability of PMUs is geographically limited due to their high cost in integration. The measurements of PMUs can be complemented by those from widely installed supervisory control and data acquisition (SCADA) to enhance the situational awareness. This paper proposes a new state estimat… ▽ More Phasor measurement units (PMUs) are integrated to the transmission networks under the smart grid umbrella. The observability of PMUs is geographically limited due to their high cost in integration. The measurements of PMUs can be complemented by those from widely installed supervisory control and data acquisition (SCADA) to enhance the situational awareness. This paper proposes a new state estimation method that simultaneously integrate both measurements, and show an outstanding performance. △ Less

Submitted 9 June, 2017; v1 submitted 2 June, 2017; originally announced June 2017.

Comments: 8 pages

Showing 1–15 of 15 results for author: Oh, H