Search | arXiv e-print repository

arXiv:2406.07803 [pdf, other]

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Authors: Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee

Abstract: Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressi… ▽ More Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model ability to control emotional style and intensity with high-quality expressive speech. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted at INTERSPEECH 2024

arXiv:2401.08095 [pdf, other]

DurFlex-EVC: Duration-Flexible Emotional Voice Conversion with Parallel Generation

Authors: Hyung-Seok Oh, Sang-Hoon Lee, Deok-Hyeon Cho, Seong-Whan Lee

Abstract: Emotional voice conversion (EVC) seeks to modify the emotional tone of a speaker's voice while preserving the original linguistic content and the speaker's unique vocal characteristics. Recent advancements in EVC have involved the simultaneous modeling of pitch and duration, utilizing the potential of sequence-to-sequence (seq2seq) models. To enhance reliability and efficiency in conversion, this… ▽ More Emotional voice conversion (EVC) seeks to modify the emotional tone of a speaker's voice while preserving the original linguistic content and the speaker's unique vocal characteristics. Recent advancements in EVC have involved the simultaneous modeling of pitch and duration, utilizing the potential of sequence-to-sequence (seq2seq) models. To enhance reliability and efficiency in conversion, this study shifts focus towards parallel speech generation. We introduce Duration-Flexible EVC (DurFlex-EVC), which integrates a style autoencoder and unit aligner. Traditional models, while incorporating self-supervised learning (SSL) representations that contain both linguistic and paralinguistic information, have neglected this dual nature, leading to reduced controllability. Addressing this issue, we implement cross-attention to synchronize these representations with various emotions. Additionally, a style autoencoder is developed for the disentanglement and manipulation of style elements. The efficacy of our approach is validated through both subjective and objective evaluations, establishing its superiority over existing models in the field. △ Less

Submitted 7 March, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

Comments: 13 pages, 9 figures, 8 tables

arXiv:2007.01524 [pdf, other]

Domain Adaptation without Source Data

Authors: Youngeun Kim, Donghyeon Cho, Kyeongtak Han, Priyadarshini Panda, Sungeun Hong

Abstract: Domain adaptation assumes that samples from source and target domains are freely accessible during a training phase. However, such an assumption is rarely plausible in the real-world and possibly causes data-privacy issues, especially when the label of the source domain can be a sensitive attribute as an identifier. To avoid accessing source data that may contain sensitive information, we introduc… ▽ More Domain adaptation assumes that samples from source and target domains are freely accessible during a training phase. However, such an assumption is rarely plausible in the real-world and possibly causes data-privacy issues, especially when the label of the source domain can be a sensitive attribute as an identifier. To avoid accessing source data that may contain sensitive information, we introduce Source data-Free Domain Adaptation (SFDA). Our key idea is to leverage a pre-trained model from the source domain and progressively update the target model in a self-learning manner. We observe that target samples with lower self-entropy measured by the pre-trained source model are more likely to be classified correctly. From this, we select the reliable samples with the self-entropy criterion and define these as class prototypes. We then assign pseudo labels for every target sample based on the similarity score with class prototypes. Furthermore, to reduce the uncertainty from the pseudo labeling process, we propose set-to-set distance-based filtering which does not require any tunable hyperparameters. Finally, we train the target model with the filtered pseudo labels with regularization from the pre-trained source model. Surprisingly, without direct usage of labeled source samples, our PrDA outperforms conventional domain adaptation methods on benchmark datasets. Our code is publicly available at https://github.com/youngryan1993/SFDA-SourceFreeDA △ Less

Submitted 30 August, 2021; v1 submitted 3 July, 2020; originally announced July 2020.

Comments: 13 pages

arXiv:1912.00374 [pdf]

Task Scheduling of Multiple Agile Satellites with Transition Time and Stereo Imaging Constraints

Authors: Junhong Kim, Doo-Hyun Cho, Jaemyung Ahn, Han-Lim Choi

Abstract: This paper proposes a framework for scheduling the observation and download tasks of multiple agile satellites with practical considerations such as attitude transition time, onboard data capacity, and stereoscopic image acquisition. A mixed integer linear programming (MILP) formulation for optimal scheduling that can address these practical considerations is introduced. A heuristic algorithm to o… ▽ More This paper proposes a framework for scheduling the observation and download tasks of multiple agile satellites with practical considerations such as attitude transition time, onboard data capacity, and stereoscopic image acquisition. A mixed integer linear programming (MILP) formulation for optimal scheduling that can address these practical considerations is introduced. A heuristic algorithm to obtain a near-optimal solution of the formulated MILP based on the time windows pruning procedure is proposed. A comprehensive case study demonstrating the validity of the proposed formulation and heuristic is presented. △ Less

Submitted 1 December, 2019; originally announced December 2019.

arXiv:1906.07851 [pdf, other]

Key Instance Selection for Unsupervised Video Object Segmentation

Authors: Donghyeon Cho, Sungeun Hong, Sungil Kang, Jiwon Kim

Abstract: This paper proposes key instance selection based on video saliency covering objectness and dynamics for unsupervised video object segmentation (UVOS). Our method takes frames sequentially and extracts object proposals with corresponding masks for each frame. We link objects according to their similarity until the M-th frame and then assign them unique IDs (i.e., instances). Similarity measure take… ▽ More This paper proposes key instance selection based on video saliency covering objectness and dynamics for unsupervised video object segmentation (UVOS). Our method takes frames sequentially and extracts object proposals with corresponding masks for each frame. We link objects according to their similarity until the M-th frame and then assign them unique IDs (i.e., instances). Similarity measure takes into account multiple properties such as ReID descriptor, expected trajectory, and semantic co-segmentation result. After M-th frame, we select K IDs based on video saliency and frequency of appearance; then only these key IDs are tracked through the remaining frames. Thanks to these technical contributions, our results are ranked third on the leaderboard of UVOS DAVIS challenge. △ Less

Submitted 26 July, 2019; v1 submitted 18 June, 2019; originally announced June 2019.

Comments: Ranked 3rd in 'Unsupervised DAVIS Challenge' (CVPR 2019)

Showing 1–5 of 5 results for author: Cho, D