Search | arXiv e-print repository

AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion

Authors: Haeyun Choi, Jio Gim, Yuho Lee, Youngin Kim, Young-Joo Suh

Abstract: This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing. Previous works suffer from information loss and poor synthesis quality due to their reliance on a carefully designed bottleneck structure. Moreover, models relying solely on self-reconstruction loss struggled with reproducing different speakers' voices. To address th… ▽ More This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing. Previous works suffer from information loss and poor synthesis quality due to their reliance on a carefully designed bottleneck structure. Moreover, models relying solely on self-reconstruction loss struggled with reproducing different speakers' voices. To address these issues, we suggested a cycle-consistency loss that considers conversion back and forth between target and source speakers. Additionally, stacked random-shuffled mel-spectrograms and a label smoothing method are utilized during speaker encoder training to extract a time-independent global speaker representation from speech, which is the key to a zero-shot conversion. Our model outperforms existing state-of-the-art results in both subjective and objective evaluations. Furthermore, it facilitates cross-lingual voice conversions and enhances the quality of synthesized speech. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:2011.01174 [pdf, other]

doi 10.1109/ACCESS.2022.3175810

Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech

Authors: Yeunju Choi, Youngmoon Jung, Youngjoo Suh, Hoirin Kim

Abstract: Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the… ▽ More Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech using the pre-trained MOS prediction model. The proposed method can be applied independently regardless of the TTS model architecture or the cause of speech quality degradation and efficiently without increasing the inference time or model complexity. The evaluation results for the MOS and phone error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility. △ Less

Submitted 25 May, 2022; v1 submitted 2 November, 2020; originally announced November 2020.

Comments: 9 pages, 5 figures, 4 tables

Journal ref: IEEE Access, vol. 10, pp. 52621 - 52629, 2022

arXiv:2010.16003 [pdf]

doi 10.32604/cmc.2020.012223

PIINET: A 360-degree Panoramic Image Inpainting Network Using a Cube Map

Authors: Seo Woo Han, Doug Young Suh

Abstract: Inpainting has been continuously studied in the field of computer vision. As artificial intelligence technology developed, deep learning technology was introduced in inpainting research, hel** to improve performance. Currently, the input target of an inpainting algorithm using deep learning has been studied from a single image to a video. However, deep learning-based inpainting technology for pa… ▽ More Inpainting has been continuously studied in the field of computer vision. As artificial intelligence technology developed, deep learning technology was introduced in inpainting research, hel** to improve performance. Currently, the input target of an inpainting algorithm using deep learning has been studied from a single image to a video. However, deep learning-based inpainting technology for panoramic images has not been actively studied. We propose a 360-degree panoramic image inpainting method using generative adversarial networks (GANs). The proposed network inputs a 360-degree equirectangular format panoramic image converts it into a cube map format, which has relatively little distortion and uses it as a training network. Since the cube map format is used, the correlation of the six sides of the cube map should be considered. Therefore, all faces of the cube map are used as input for the whole discriminative network, and each face of the cube map is used as input for the slice discriminative network to determine the authenticity of the generated image. The proposed network performed qualitatively better than existing single-image inpainting algorithms and baseline algorithms. △ Less

Submitted 26 January, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

Journal ref: Vol.66, No.1, 2021, pp.213-228

arXiv:2006.06937 [pdf]

Non-parallel voice conversion based on source-to-target direct map**

Authors: Sunghee Jung, Youngjoo Suh, Yeunju Choi, Hoirin Kim

Abstract: Recent works of utilizing phonetic posteriograms (PPGs) for non-parallel voice conversion have significantly increased the usability of voice conversion since the source and target DBs are no longer required for matching contents. In this approach, the PPGs are used as the linguistic bridge between source and target speaker features. However, this PPG-based non-parallel voice conversion has some l… ▽ More Recent works of utilizing phonetic posteriograms (PPGs) for non-parallel voice conversion have significantly increased the usability of voice conversion since the source and target DBs are no longer required for matching contents. In this approach, the PPGs are used as the linguistic bridge between source and target speaker features. However, this PPG-based non-parallel voice conversion has some limitation that it needs two cascading networks at conversion time, making it less suitable for real-time applications and vulnerable to source speaker intelligibility at conversion stage. To address this limitation, we propose a new non-parallel voice conversion technique that employs a single neural network for direct source-to-target voice parameter map**. With this single network structure, the proposed approach can reduce both conversion time and number of network parameters, which can be especially important factors in embedded or real-time environments. Additionally, it improves the quality of voice conversion by skip** the phone recognizer at conversion stage. It can effectively prevent possible loss of phonetic information the PPG-based indirect method suffers. Experiments show that our approach reduces number of network parameters and conversion time by 41.9% and 44.5%, respectively, with improved voice similarity over the original PPG-based method. △ Less

Submitted 12 June, 2020; originally announced June 2020.

Comments: Submitted to Interspeech 2019

arXiv:1908.07335 [pdf, other]

doi 10.1109/iCCECE46942.2019.8941882

Learning-Driven Wireless Communications, towards 6G

Authors: Md. Jalil Piran, Doug Young Suh

Abstract: The fifth generation (5G) of wireless communication is in its infancy, and its evolving versions will be launched over the coming years. However, according to exposing the inherent constraints of 5G and the emerging applications and services with stringent requirements e.g. latency, energy/bit, traffic capacity, peak data rate, and reliability, telecom researchers are turning their attention to co… ▽ More The fifth generation (5G) of wireless communication is in its infancy, and its evolving versions will be launched over the coming years. However, according to exposing the inherent constraints of 5G and the emerging applications and services with stringent requirements e.g. latency, energy/bit, traffic capacity, peak data rate, and reliability, telecom researchers are turning their attention to conceptualize the next generation of wireless communications, i.e. 6G. In this paper, we investigate 6G challenges, requirements, and trends. Furthermore, we discuss how artificial intelligence (AI) techniques can contribute to 6G. Based on the requirements and solutions, we identify some new fascinating services and use-cases of 6G, which can not be supported by 5G appropriately. Moreover, we explain some research directions that lead to the successful conceptualization and implementation of 6G. △ Less

Submitted 1 August, 2019; originally announced August 2019.

Report number: 19276363

Journal ref: 2019 International Conference on Computing, Electronics & Communications Engineering (iCCECE)

Showing 1–5 of 5 results for author: Suh, Y