-
Papez: Resource-Efficient Speech Separation with Auditory Working Memory
Authors:
Hyunseok Oh,
Juheon Yi,
Youngki Lee
Abstract:
Transformer-based models recently reached state-of-the-art single-channel speech separation accuracy; However, their extreme computational load makes it difficult to deploy them in resource-constrained mobile or IoT devices. We thus present Papez, a lightweight and computation-efficient single-channel speech separation model. Papez is based on three key techniques. We first replace the inter-chunk…
▽ More
Transformer-based models recently reached state-of-the-art single-channel speech separation accuracy; However, their extreme computational load makes it difficult to deploy them in resource-constrained mobile or IoT devices. We thus present Papez, a lightweight and computation-efficient single-channel speech separation model. Papez is based on three key techniques. We first replace the inter-chunk Transformer with small-sized auditory working memory. Second, we adaptively prune the input tokens that do not need further processing. Finally, we reduce the number of parameters through the recurrent transformer. Our extensive evaluation shows that Papez achieves the best resource and accuracy tradeoffs with a large margin. We publicly share our source code at \texttt{https://github.com/snuhcs/Papez}
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech
Authors:
Deok-Hyeon Cho,
Hyung-Seok Oh,
Seung-Bin Kim,
Sang-Hoon Lee,
Seong-Whan Lee
Abstract:
Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressi…
▽ More
Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model ability to control emotional style and intensity with high-quality expressive speech.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Machine Learning-Aided Cooperative Localization under Dense Urban Environment
Authors:
Hoon Lee,
Hong Ki Kim,
Seung Hyun Oh,
Sang Hyun Lee
Abstract:
Future wireless network technology provides automobiles with the connectivity feature to consolidate the concept of vehicular networks that collaborate on conducting cooperative driving tasks. The full potential of connected vehicles, which promises road safety and quality driving experience, can be leveraged if machine learning models guarantee the robustness in performing core functions includin…
▽ More
Future wireless network technology provides automobiles with the connectivity feature to consolidate the concept of vehicular networks that collaborate on conducting cooperative driving tasks. The full potential of connected vehicles, which promises road safety and quality driving experience, can be leveraged if machine learning models guarantee the robustness in performing core functions including localization and controls. Location awareness, in particular, lends itself to the deployment of location-specific services and the improvement of the operation performance. The localization entails direct communication to the network infrastructure, and the resulting centralized positioning solutions readily become intractable as the network scales up. As an alternative to the centralized solutions, this article addresses decentralized principle of vehicular localization reinforced by machine learning techniques in dense urban environments with frequent inaccessibility to reliable measurement. As such, the collaboration of multiple vehicles enhances the positioning performance of machine learning approaches. A virtual testbed is developed to validate this machine learning model for real-map vehicular networks. Numerical results demonstrate universal feasibility of cooperative localization, in particular, for dense urban area configurations.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
DurFlex-EVC: Duration-Flexible Emotional Voice Conversion with Parallel Generation
Authors:
Hyung-Seok Oh,
Sang-Hoon Lee,
Deok-Hyeon Cho,
Seong-Whan Lee
Abstract:
Emotional voice conversion (EVC) seeks to modify the emotional tone of a speaker's voice while preserving the original linguistic content and the speaker's unique vocal characteristics. Recent advancements in EVC have involved the simultaneous modeling of pitch and duration, utilizing the potential of sequence-to-sequence (seq2seq) models. To enhance reliability and efficiency in conversion, this…
▽ More
Emotional voice conversion (EVC) seeks to modify the emotional tone of a speaker's voice while preserving the original linguistic content and the speaker's unique vocal characteristics. Recent advancements in EVC have involved the simultaneous modeling of pitch and duration, utilizing the potential of sequence-to-sequence (seq2seq) models. To enhance reliability and efficiency in conversion, this study shifts focus towards parallel speech generation. We introduce Duration-Flexible EVC (DurFlex-EVC), which integrates a style autoencoder and unit aligner. Traditional models, while incorporating self-supervised learning (SSL) representations that contain both linguistic and paralinguistic information, have neglected this dual nature, leading to reduced controllability. Addressing this issue, we implement cross-attention to synchronize these representations with various emotions. Additionally, a style autoencoder is developed for the disentanglement and manipulation of style elements. The efficacy of our approach is validated through both subjective and objective evaluations, establishing its superiority over existing models in the field.
△ Less
Submitted 7 March, 2024; v1 submitted 15 January, 2024;
originally announced January 2024.
-
Microphone Conversion: Mitigating Device Variability in Sound Event Classification
Authors:
Myeonghoon Ryu,
Hongseok Oh,
Suji Lee,
Han Park
Abstract:
In this study, we introduce a new augmentation technique to enhance the resilience of sound event classification (SEC) systems against device variability through the use of CycleGAN. We also present a unique dataset to evaluate this method. As SEC systems become increasingly common, it is crucial that they work well with audio from diverse recording devices. Our method addresses limited device div…
▽ More
In this study, we introduce a new augmentation technique to enhance the resilience of sound event classification (SEC) systems against device variability through the use of CycleGAN. We also present a unique dataset to evaluate this method. As SEC systems become increasingly common, it is crucial that they work well with audio from diverse recording devices. Our method addresses limited device diversity in training data by enabling unpaired training to transform input spectrograms as if they are recorded on a different device. Our experiments show that our approach outperforms existing methods in generalization by 5.2% - 11.5% in weighted f1 score. Additionally, it surpasses the current methods in adaptability across diverse recording devices by achieving a 6.5% - 12.8% improvement in weighted f1 score.
△ Less
Submitted 12 January, 2024;
originally announced January 2024.
-
Adversarial Denoising Diffusion Model for Unsupervised Anomaly Detection
Authors:
Jongmin Yu,
Hyeontaek Oh,
**hong Yang
Abstract:
In this paper, we propose the Adversarial Denoising Diffusion Model (ADDM). The ADDM is based on the Denoising Diffusion Probabilistic Model (DDPM) but complementarily trained by adversarial learning. The proposed adversarial learning is achieved by classifying model-based denoised samples and samples to which random Gaussian noise is added to a specific sampling step. With the addition of explici…
▽ More
In this paper, we propose the Adversarial Denoising Diffusion Model (ADDM). The ADDM is based on the Denoising Diffusion Probabilistic Model (DDPM) but complementarily trained by adversarial learning. The proposed adversarial learning is achieved by classifying model-based denoised samples and samples to which random Gaussian noise is added to a specific sampling step. With the addition of explicit adversarial learning on data samples, ADDM can learn the semantic characteristics of the data more robustly during training, which achieves a similar data sampling performance with much fewer sampling steps than DDPM. We apply ADDM to anomaly detection in unsupervised MRI images. Experimental results show that the proposed ADDM outperformed existing generative model-based unsupervised anomaly detection methods. In particular, compared to other DDPM-based anomaly detection methods, the proposed ADDM shows better performance with the same number of sampling steps and similar performance with 50% fewer sampling steps.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Reachable Set-based Path Planning for Automated Vertical Parking System
Authors:
In Hyuk Oh,
Ju Won Seo,
** Sung Kim,
Chung Choo Chung
Abstract:
This paper proposes a local path planning method with a reachable set for Automated vertical Parking Systems (APS). First, given a parking lot layout with a goal position, we define an intermediate pose for the APS to accomplish reverse parking with a single maneuver, i.e., without changing the gear shift. Then, we introduce a reachable set which is a set of points consisting of the grid points of…
▽ More
This paper proposes a local path planning method with a reachable set for Automated vertical Parking Systems (APS). First, given a parking lot layout with a goal position, we define an intermediate pose for the APS to accomplish reverse parking with a single maneuver, i.e., without changing the gear shift. Then, we introduce a reachable set which is a set of points consisting of the grid points of all possible intermediate poses. Once the APS approaches the goal position, it must select an intermediate pose in the reachable set. A minimization problem was formulated and solved to choose the intermediate pose. We performed various scenarios with different parking lot conditions. We used the Hybrid-A* algorithm for the global path planning to move the vehicle from the starting pose to the intermediate pose and utilized clothoid-based local path planning to move from the intermediate pose to the goal pose. Additionally, we designed a controller to follow the generated path and validated its tracking performance. It was confirmed that the tracking error in the mean root square for the lateral position was bounded within 0.06m and for orientation within 0.01rad.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
-
DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training
Authors:
Hyung-Seok Oh,
Sang-Hoon Lee,
Seong-Whan Lee
Abstract:
Expressive text-to-speech systems have undergone significant advancements owing to prosody modeling, but conventional methods can still be improved. Traditional approaches have relied on the autoregressive method to predict the quantized prosody vector; however, it suffers from the issues of long-term dependency and slow inference. This study proposes a novel approach called DiffProsody in which e…
▽ More
Expressive text-to-speech systems have undergone significant advancements owing to prosody modeling, but conventional methods can still be improved. Traditional approaches have relied on the autoregressive method to predict the quantized prosody vector; however, it suffers from the issues of long-term dependency and slow inference. This study proposes a novel approach called DiffProsody in which expressive speech is synthesized using a diffusion-based latent prosody generator and prosody conditional adversarial training. Our findings confirm the effectiveness of our prosody generator in generating a prosody vector. Furthermore, our prosody conditional discriminator significantly improves the quality of the generated speech by accurately emulating prosody. We use denoising diffusion generative adversarial networks to improve the prosody generation speed. Consequently, DiffProsody is capable of generating prosody 16 times faster than the conventional diffusion model. The superior performance of our proposed method has been demonstrated via experiments.
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer
Authors:
Sang-Hoon Lee,
Ha-Yeong Choi,
Hyung-Seok Oh,
Seong-Whan Lee
Abstract:
Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervis…
▽ More
Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{https://hiervst.github.io/}.
△ Less
Submitted 30 July, 2023;
originally announced July 2023.
-
Deep Unsupervised Domain Adaptation: A Review of Recent Advances and Perspectives
Authors:
Xiaofeng Liu,
Chaehwa Yoo,
Fangxu Xing,
Hye** Oh,
Georges El Fakhri,
Je-Won Kang,
Jonghye Woo
Abstract:
Deep learning has become the method of choice to tackle real-world problems in different domains, partly because of its ability to learn from data and achieve impressive performance on a wide range of applications. However, its success usually relies on two assumptions: (i) vast troves of labeled datasets are required for accurate model fitting, and (ii) training and testing data are independent a…
▽ More
Deep learning has become the method of choice to tackle real-world problems in different domains, partly because of its ability to learn from data and achieve impressive performance on a wide range of applications. However, its success usually relies on two assumptions: (i) vast troves of labeled datasets are required for accurate model fitting, and (ii) training and testing data are independent and identically distributed. Its performance on unseen target domains, thus, is not guaranteed, especially when encountering out-of-distribution data at the adaptation stage. The performance drop on data in a target domain is a critical problem in deploying deep neural networks that are successfully trained on data in a source domain. Unsupervised domain adaptation (UDA) is proposed to counter this, by leveraging both labeled source domain data and unlabeled target domain data to carry out various tasks in the target domain. UDA has yielded promising results on natural image processing, video analysis, natural language processing, time-series data analysis, medical image analysis, etc. In this review, as a rapidly evolving topic, we provide a systematic comparison of its methods and applications. In addition, the connection of UDA with its closely related tasks, e.g., domain generalization and out-of-distribution detection, has also been discussed. Furthermore, deficiencies in current methods and possible promising directions are highlighted.
△ Less
Submitted 15 August, 2022;
originally announced August 2022.
-
Model-plant mismatch learning offset-free model predictive control
Authors:
Sang Hwan Son,
Jong Woo Kim,
Tae Hoon Oh,
Jong Min Lee
Abstract:
We propose model-plant mismatch learning offset-free model predictive control (MPC), which learns and applies the intrinsic model-plant mismatch, to effectively exploit the advantages of model-based and data-driven control strategies and overcome the limitations of each approach. In this study, the model-plant mismatch map on steady-state manifold in the controlled variable space is approximated v…
▽ More
We propose model-plant mismatch learning offset-free model predictive control (MPC), which learns and applies the intrinsic model-plant mismatch, to effectively exploit the advantages of model-based and data-driven control strategies and overcome the limitations of each approach. In this study, the model-plant mismatch map on steady-state manifold in the controlled variable space is approximated via a general regression neural network from the steady-state data for each setpoint. Though the learned model-plant mismatch map can provide the information at the equilibrium point (i.e., setpoint), it cannot provide model-plant mismatch information during the transient state. Moreover, the intrinsic model-plant mismatch can vary due to system characteristics changes during operation. Therefore, we additionally apply a supplementary disturbance variable which is updated from the disturbance estimator based on the nominal offset-free MPC scheme. Then, the combined disturbance signal is applied to the target problem and finite-horizon optimal control problem of offset-free MPC to improve the prediction accuracy and closed-loop performance of the controller. By this, we can exploit both the learned model-plant mismatch information and the stabilizing property of the nominal disturbance estimator approach. The closed-loop simulation results demonstrate that the developed scheme can properly learn the intrinsic model-plant mismatch and efficiently improve the model-plant mismatch compensating performance in offset-free MPC. Moreover, we examine the robust asymptotic stability of the developed offset-free MPC scheme, which is known to be difficult to analyze in nominal offset-free MPC, by exploiting the learned model-plant mismatch information.
△ Less
Submitted 13 December, 2020; v1 submitted 4 December, 2020;
originally announced December 2020.
-
Unit Commitment Considering the Impact of Deep Cycling
Authors:
HyungSeon Oh
Abstract:
Wind energy has been integrated into the power system with the hope that it improves the energy efficiency and decreases greenhouse gas emission. However, several studies over the world imply that the result was in the opposite way that was hoped mainly because of the negative correlation between wind availability and load. Under the situation, coal power plants are forced to cycle while they are…
▽ More
Wind energy has been integrated into the power system with the hope that it improves the energy efficiency and decreases greenhouse gas emission. However, several studies over the world imply that the result was in the opposite way that was hoped mainly because of the negative correlation between wind availability and load. Under the situation, coal power plants are forced to cycle while they are not designed to do so. To prevent this unwanted result from occurring, a unit commitment decision should include the use of fuel and the emission rate during the ramp up/down process. This paper proposes a new unit commitment decision process to accommodate the economic and the environmental costs associated with the ram** process. The costs are, in general, not convex because there is positive cost if a generator output changes significantly regardless of directions. As a result, the problem might be nonconvex. A piece-wise linear cost curve is introduced to model the impact of ram** processes. With the curve, a convex linear programming is formulated, and the impact of a governmental policy is discussed.
△ Less
Submitted 30 May, 2020;
originally announced June 2020.
-
Analytical solution to swing equations in power grids
Authors:
HyungSeon Oh
Abstract:
Objective: To derive a closed-form analytical solution to the swing equation describing the power system dynamics, which is a nonlinear second order differential equation. Existing challenges: No analytical solution to the swing equation has been identified, due to the complex nature of power systems. Two major approaches are pursued for stability assessments on systems: (1) computationally simple…
▽ More
Objective: To derive a closed-form analytical solution to the swing equation describing the power system dynamics, which is a nonlinear second order differential equation. Existing challenges: No analytical solution to the swing equation has been identified, due to the complex nature of power systems. Two major approaches are pursued for stability assessments on systems: (1) computationally simple models based on physically unacceptable assumptions, and (2) digital simulations with high computational costs. Motivation: The motion of the rotor angle that the swing equation describes is a vector function. Often, a simple form of the physical laws is revealed by coordinate transformation. Methods: The study included the formulation of the swing equation in the Cartesian coordinate system, which is different from conventional approaches that describe the equation in the polar coordinate system. Based on the properties and operational conditions of electric power grids referred to in the literature, we identified the swing equation in the Cartesian coordinate system and derived an analytical solution within a validity region. Results: The estimated results from the analytical solution derived in this study agree with the results using conventional methods, which indicates the derived analytical solution is correct. Conclusion: An analytical solution to the swing equation is derived without unphysical assumptions, and the closed-form solution correctly estimates the dynamics after a fault occurs.
△ Less
Submitted 25 November, 2019;
originally announced November 2019.
-
Ensemble Patch Transformation: A New Tool for Signal Decomposition
Authors:
Donghoh Kim,
Guebin Choi,
Hee-Seok Oh
Abstract:
This paper considers the problem of signal decomposition and data visualization. For this purpose, we introduce a new multiscale transform, termed `ensemble patch transformation' that enhances identification of local characteristics embedded in a signal and provides multiscale visualization according to different levels; hence, it is useful for data analysis and signal decomposition. In literature…
▽ More
This paper considers the problem of signal decomposition and data visualization. For this purpose, we introduce a new multiscale transform, termed `ensemble patch transformation' that enhances identification of local characteristics embedded in a signal and provides multiscale visualization according to different levels; hence, it is useful for data analysis and signal decomposition. In literature, there are data-adaptive decomposition methods such as empirical mode decomposition (EMD) by Huang et al. (1998). Along the same line of EMD, we propose a new decomposition algorithm that extracts meaningful components from a signal that belongs to a large class of signals, compared to the previous methods. Some theoretical properties of the proposed algorithm are investigated. To evaluate the proposed method, we analyze several synthetic examples and a real-world signal.
△ Less
Submitted 7 April, 2019;
originally announced April 2019.
-
Situational Awareness with PMUs and SCADA
Authors:
HyungSeon Oh
Abstract:
Phasor measurement units (PMUs) are integrated to the transmission networks under the smart grid umbrella. The observability of PMUs is geographically limited due to their high cost in integration. The measurements of PMUs can be complemented by those from widely installed supervisory control and data acquisition (SCADA) to enhance the situational awareness. This paper proposes a new state estimat…
▽ More
Phasor measurement units (PMUs) are integrated to the transmission networks under the smart grid umbrella. The observability of PMUs is geographically limited due to their high cost in integration. The measurements of PMUs can be complemented by those from widely installed supervisory control and data acquisition (SCADA) to enhance the situational awareness. This paper proposes a new state estimation method that simultaneously integrate both measurements, and show an outstanding performance.
△ Less
Submitted 9 June, 2017; v1 submitted 2 June, 2017;
originally announced June 2017.