Search | arXiv e-print repository

arXiv:2406.05965 [pdf, other]

MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance

Authors: Semin Kim, Myeonghun Jeong, Hyeonseung Lee, Minchan Kim, Byoung ** Choi, Nam Soo Kim

Abstract: In this paper, we propose MakeSinger, a semi-supervised training method for singing voice synthesis (SVS) via classifier-free diffusion guidance. The challenge in SVS lies in the costly process of gathering aligned sets of text, pitch, and audio data. MakeSinger enables the training of the diffusion-based SVS model from any speech and singing voice data regardless of its labeling, thereby enhancin… ▽ More In this paper, we propose MakeSinger, a semi-supervised training method for singing voice synthesis (SVS) via classifier-free diffusion guidance. The challenge in SVS lies in the costly process of gathering aligned sets of text, pitch, and audio data. MakeSinger enables the training of the diffusion-based SVS model from any speech and singing voice data regardless of its labeling, thereby enhancing the quality of generated voices with large amount of unlabeled data. At inference, our novel dual guiding mechanism gives text and pitch guidance on the reverse diffusion step by estimating the score of masked input. Experimental results show that the model trained in a semi-supervised manner outperforms other baselines trained only on the labeled data in terms of pronunciation, pitch accuracy and overall quality. Furthermore, we demonstrate that by adding Text-to-Speech (TTS) data in training, the model can synthesize the singing voices of TTS speakers even without their singing voices. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

arXiv:2401.01498 [pdf, other]

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Authors: Minchan Kim, Myeonghun Jeong, Byoung ** Choi, Semin Kim, Joun Yeop Lee, Nam Soo Kim

Abstract: We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token trans… ▽ More We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token transducer for the semantic token prediction, benefiting from its hard monotonic alignment constraints. Subsequently, a non-autoregressive (NAR) speech generator efficiently synthesizes waveforms from these semantic tokens. Additionally, a reference speech controls temporal dynamics and acoustic conditions at each stage. This decoupled framework reduces the training complexity of TTS while allowing each stage to focus on semantic and acoustic modeling. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity, both objectively and subjectively. We also delve into the inference speed and prosody control capabilities of our approach, highlighting the potential of neural transducers in TTS frameworks. △ Less

Submitted 2 January, 2024; originally announced January 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2311.02898 [pdf, other]

Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction

Authors: Minchan Kim, Myeonghun Jeong, Byoung ** Choi, Dongjune Lee, Nam Soo Kim

Abstract: We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints. The proposed model first generates aligned semantic tokens using the neural transducer, then synthesizes a speech sample from the semant… ▽ More We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints. The proposed model first generates aligned semantic tokens using the neural transducer, then synthesizes a speech sample from the semantic tokens using a non-autoregressive(NAR) speech generator. This decoupled framework alleviates the training complexity of TTS and allows each stage to focus on 1) linguistic and alignment modeling and 2) fine-grained acoustic modeling, respectively. Experimental results on the zero-shot adaptive TTS show that the proposed model exceeds the baselines in speech quality and speaker similarity via objective and subjective measures. We also investigate the inference speed and prosody controllability of our proposed model, showing the potential of the neural transducer for TTS frameworks. △ Less

Submitted 8 November, 2023; v1 submitted 6 November, 2023; originally announced November 2023.

Comments: Accepted at ASRU2023

arXiv:2305.15526 [pdf, other]

Radiomap Inpainting for Restricted Areas based on Propagation Priority and Depth Map

Authors: Songyang Zhang, Tianhang Yu, Brian Choi, Feng Ouyang, Zhi Ding

Abstract: Providing rich and useful information regarding spectrum activities and propagation channels, radiomaps characterize the detailed distribution of power spectral density (PSD) and are important tools for network planning in modern wireless systems. Generally, radiomaps are constructed from radio strength measurements by deployed sensors and user devices. However, not all areas are accessible for ra… ▽ More Providing rich and useful information regarding spectrum activities and propagation channels, radiomaps characterize the detailed distribution of power spectral density (PSD) and are important tools for network planning in modern wireless systems. Generally, radiomaps are constructed from radio strength measurements by deployed sensors and user devices. However, not all areas are accessible for radio measurements due to physical constraints and security consideration, leading to non-uniformly spaced measurements and blanks on a radiomap. In this work, we explore distribution of radio spectrum strengths in view of surrounding environments, and propose two radiomap inpainting approaches for the reconstruction of radiomaps that cover missing areas. Specifically, we first define a propagation-based priority and integrate exemplar-based inpainting with radio propagation model for fine-resolution small-size missing area reconstruction on a radiomap. Then, we introduce a novel radio depth map and propose a two-step template-perturbation approach for large-size restricted region inpainting. Our experimental results demonstrate the power of the proposed propagation priority and radio depth map in capturing the PSD distribution, as well as the efficacy of the proposed methods for radiomap reconstruction. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: submitted to IEEE journal for possible publication

arXiv:2211.16866 [pdf, other]

doi 10.1109/LSP.2022.3226655

SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech

Authors: Byoung ** Choi, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim

Abstract: Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voice characteristic of an unseen speaker. The main challenge of ZSM-TTS is to increase the overall speaker similarity for unseen speakers. One of the most successful speaker conditioning methods for flow-based multi-speaker text-to-speech (TTS) models is to utilize the functions which predict the scal… ▽ More Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voice characteristic of an unseen speaker. The main challenge of ZSM-TTS is to increase the overall speaker similarity for unseen speakers. One of the most successful speaker conditioning methods for flow-based multi-speaker text-to-speech (TTS) models is to utilize the functions which predict the scale and bias parameters of the affine coupling layers according to the given speaker embedding vector. In this letter, we improve on the previous speaker conditioning method by introducing a speaker-normalized affine coupling (SNAC) layer which allows for unseen speaker speech synthesis in a zero-shot manner leveraging a normalization-based conditioning technique. The newly designed coupling layer explicitly normalizes the input by the parameters predicted from a speaker embedding vector while training, enabling an inverse process of denormalizing for a new speaker embedding at inference. The proposed conditioning scheme yields the state-of-the-art performance in terms of the speech quality and speaker similarity in a ZSM-TTS setting. △ Less

Submitted 30 November, 2022; originally announced November 2022.

Comments: Accepted to IEEE Signal Processing Letters

arXiv:2210.05979 [pdf, other]

Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech

Authors: Byoung ** Choi, Myeonghun Jeong, Minchan Kim, Sung Hwan Mun, Nam Soo Kim

Abstract: Several recently proposed text-to-speech (TTS) models achieved to generate the speech samples with the human-level quality in the single-speaker and multi-speaker TTS scenarios with a set of pre-defined speakers. However, synthesizing a new speaker's voice with a single reference audio, commonly known as zero-shot multi-speaker text-to-speech (ZSM-TTS), is still a very challenging task. The main c… ▽ More Several recently proposed text-to-speech (TTS) models achieved to generate the speech samples with the human-level quality in the single-speaker and multi-speaker TTS scenarios with a set of pre-defined speakers. However, synthesizing a new speaker's voice with a single reference audio, commonly known as zero-shot multi-speaker text-to-speech (ZSM-TTS), is still a very challenging task. The main challenge of ZSM-TTS is the speaker domain shift problem upon the speech generation of a new speaker. To mitigate this problem, we propose adversarial speaker-consistency learning (ASCL). The proposed method first generates an additional speech of a query speaker using the external untranscribed datasets at each training iteration. Then, the model learns to consistently generate the speech sample of the same speaker as the corresponding speaker embedding vector by employing an adversarial learning scheme. The experimental results show that the proposed method is effective compared to the baseline in terms of the quality and speaker similarity in ZSM-TTS. △ Less

Submitted 22 November, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: APSIPA 2022

arXiv:2209.04566 [pdf, other]

Exemplar-Based Radio Map Reconstruction of Missing Areas Using Propagation Priority

Authors: Songyang Zhang, Tianhang Yu, Jonathan Tivald, Brian Choi, Feng Ouyang, Zhi Ding

Abstract: Radio map describes network coverage and is a practically important tool for network planning in modern wireless systems. Generally, radio strength measurements are collected to construct fine-resolution radio maps for analysis. However, certain protected areas are not accessible for measurement due to physical constraints and security considerations, leading to blanked spaces on a radio map. Non-… ▽ More Radio map describes network coverage and is a practically important tool for network planning in modern wireless systems. Generally, radio strength measurements are collected to construct fine-resolution radio maps for analysis. However, certain protected areas are not accessible for measurement due to physical constraints and security considerations, leading to blanked spaces on a radio map. Non-uniformly spaced measurement and uneven observation resolution make it more difficult for radio map estimation and spectrum planning in protected areas. This work explores the distribution of radio spectrum strengths and proposes an exemplar-based approach to reconstruct missing areas on a radio map. Instead of taking generic image processing approaches, we leverage radio propagation models to determine directions of region filling and develop two different schemes to estimate the missing radio signal power. Our test results based on high-fidelity simulation demonstrate efficacy of the proposed methods for radio map reconstruction. △ Less

Submitted 9 September, 2022; originally announced September 2022.

Comments: To appear in 2022 IEEE Global Communications Conference (Globecom)

arXiv:2203.15447 [pdf, other]

doi 10.21437/Interspeech.2022-225

Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Authors: Minchan Kim, Myeonghun Jeong, Byoung ** Choi, Sunghwan Ahn, Joun Yeop Lee, Nam Soo Kim

Abstract: Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also… ▽ More Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also extend the proposed method to zero-shot multi-speaker TTS (ZS-TTS). The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization. We highlight that the single speaker TTS model fine-tuned on the only 10 minutes of labeled dataset outperforms the other baselines, and the ZS-TTS model fine-tuned on the only 30 minutes of single speaker dataset can generate the voice of the arbitrary speaker, by pre-training on unlabeled multi-speaker speech corpus. △ Less

Submitted 6 October, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: Accepted by Interspeech2022

arXiv:2104.01409 [pdf, other]

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

Authors: Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung ** Choi, Nam Soo Kim

Abstract: Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency. In this work, we propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising d… ▽ More Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency. In this work, we propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps. In order to learn the mel-spectrogram distribution conditioned on the text, we present a likelihood-based optimization method for TTS. Furthermore, to boost up the inference speed, we leverage the accelerated sampling method that allows Diff-TTS to generate raw waveforms much faster without significantly degrading perceptual quality. Through experiments, we verified that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU. △ Less

Submitted 3 April, 2021; originally announced April 2021.

Comments: Submitted to INTERSPEECH 2021

arXiv:2104.00436 [pdf, other]

doi 10.21437/Interspeech.2021-465

Expressive Text-to-Speech using Style Tag

Authors: Minchan Kim, Sung Jun Cheon, Byoung ** Choi, Jong ** Kim, Nam Soo Kim

Abstract: As recent text-to-speech (TTS) systems have been rapidly improved in speech quality and generation speed, many researchers now focus on a more challenging issue: expressive TTS. To control speaking styles, existing expressive TTS models use categorical style index or reference speech as style input. In this work, we propose StyleTagging-TTS (ST-TTS), a novel expressive TTS model that utilizes a st… ▽ More As recent text-to-speech (TTS) systems have been rapidly improved in speech quality and generation speed, many researchers now focus on a more challenging issue: expressive TTS. To control speaking styles, existing expressive TTS models use categorical style index or reference speech as style input. In this work, we propose StyleTagging-TTS (ST-TTS), a novel expressive TTS model that utilizes a style tag written in natural language. Using a style-tagged TTS dataset and a pre-trained language model, we modeled the relationship between linguistic embedding and speaking style domain, which enables our model to work even with style tags unseen during training. As style tag is written in natural language, it can control speaking style in a more intuitive, interpretable, and scalable way compared with style index or reference speech. In addition, in terms of model architecture, we propose an efficient non-autoregressive (NAR) TTS architecture with single-stage training. The experimental result shows that ST-TTS outperforms the existing expressive TTS model, Tacotron2-GST in speech quality and expressiveness. △ Less

Submitted 6 October, 2022; v1 submitted 1 April, 2021; originally announced April 2021.

Comments: Accepted by Interspeech 2021

arXiv:2012.00179 [pdf, other]

Crowd-Sourced Road Quality Map** in the Develo** World

Authors: Benjamin Choi, John Kamalu

Abstract: Road networks are among the most essential components of a country's infrastructure. By facilitating the movement and exchange of goods, people, and ideas, they support economic and cultural activity both within and across borders. Up-to-date map** of the the geographical distribution of roads and their quality is essential in high-impact applications ranging from land use planning to wilderness… ▽ More Road networks are among the most essential components of a country's infrastructure. By facilitating the movement and exchange of goods, people, and ideas, they support economic and cultural activity both within and across borders. Up-to-date map** of the the geographical distribution of roads and their quality is essential in high-impact applications ranging from land use planning to wilderness conservation. Map** presents a particularly pressing challenge in develo** countries, where documentation is poor and disproportionate amounts of road construction are expected to occur in the coming decades. We present a new crowd-sourced approach capable of assessing road quality and identify key challenges and opportunities in the transferability of deep learning based methods across domains. △ Less

Submitted 30 November, 2020; originally announced December 2020.

Comments: Presented at NeurIPS 2020 Workshop on Machine Learning for the Develo** World

arXiv:2006.04598 [pdf, other]

WaveNODE: A Continuous Normalizing Flow for Speech Synthesis

Authors: Hyeongju Kim, Hyeonseung Lee, Woo Hyun Kang, Sung Jun Cheon, Byoung ** Choi, Nam Soo Kim

Abstract: In recent years, various flow-based generative models have been proposed to generate high-fidelity waveforms in real-time. However, these models require either a well-trained teacher network or a number of flow steps making them memory-inefficient. In this paper, we propose a novel generative model called WaveNODE which exploits a continuous normalizing flow for speech synthesis. Unlike the conven… ▽ More In recent years, various flow-based generative models have been proposed to generate high-fidelity waveforms in real-time. However, these models require either a well-trained teacher network or a number of flow steps making them memory-inefficient. In this paper, we propose a novel generative model called WaveNODE which exploits a continuous normalizing flow for speech synthesis. Unlike the conventional models, WaveNODE places no constraint on the function used for flow operation, thus allowing the usage of more flexible and complex functions. Moreover, WaveNODE can be optimized to maximize the likelihood without requiring any teacher network or auxiliary loss terms. We experimentally show that WaveNODE achieves comparable performance with fewer parameters compared to the conventional flow-based vocoders. △ Less

Submitted 2 July, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: 8 pages, 4 figures, Second workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models (ICML 2020)

arXiv:2005.10985 [pdf]

Apply VGGNet-based deep learning model of vibration data for prediction model of gravity acceleration equipment

Authors: SeonWoo Lee, HyeonTak Yu, HoJun Yang, JaeHeung Yang, GangMin Lim, KyuSung Kim, ByeongKeun Choi, JangWoo Kwon

Abstract: Hypergravity accelerators are a type of large machinery used for gravity training or medical research. A failure of such large equipment can be a serious problem in terms of safety or costs. This paper proposes a prediction model that can proactively prevent failures that may occur in a hypergravity accelerator. The method proposed in this paper was to convert vibration signals to spectograms and… ▽ More Hypergravity accelerators are a type of large machinery used for gravity training or medical research. A failure of such large equipment can be a serious problem in terms of safety or costs. This paper proposes a prediction model that can proactively prevent failures that may occur in a hypergravity accelerator. The method proposed in this paper was to convert vibration signals to spectograms and perform classification training using a deep learning model. An experiment was conducted to evaluate the performance of the method proposed in this paper. A 4-channel accelerometer was attached to the bearing housing, which is a rotor, and time-amplitude data were obtained from the measured values by sampling. The data were converted to a two-dimensional spectrogram, and classification training was performed using a deep learning model for four conditions of the equipment: Unbalance, Misalignment, Shaft Rubbing, and Normal. The experimental results showed that the proposed method had a 99.5% F1-Score, which was up to 23% higher than the 76.25% for existing feature-based learning models. △ Less

Submitted 18 August, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

Comments: 15 pages, 10 figures "for associated publication of paper is as follow: Journal of Mechanics in Medicine and Biology, https://www.worldscientific.com/worldscinet/jmmb"

arXiv:1908.05133 [pdf]

Assessing Workers Perceived Risk During Construction Task Using A Wristband-Type Biosensor

Authors: Byungjoo Choi, Gaang Lee, Houtan Jebelli, SangHyun Lee

Abstract: The construction industry has demonstrated a high frequency and severity of accidents. Construction accidents are the result of the interaction between unsafe work conditions and workers unsafe behaviors. Given this relation, perceived risk is determined by an individual response to a potential work hazard during the work. As such, risk perception is critical to understand workers unsafe behaviors… ▽ More The construction industry has demonstrated a high frequency and severity of accidents. Construction accidents are the result of the interaction between unsafe work conditions and workers unsafe behaviors. Given this relation, perceived risk is determined by an individual response to a potential work hazard during the work. As such, risk perception is critical to understand workers unsafe behaviors. Established methods of assessing workers perceived risk have mainly relied on surveys and interviews. However, these post-hoc methods, which are limited to monitoring dynamic changes in risk perception and conducting surveys at a construction site, may prove cumbersome to workers. Additionally, these methods frequently suffer from self-reported bias. To overcome the limitations of previous subjective measures, this study aims to develop a framework for the objective and continuous prediction of construction workers perceived risk using physiological signals [e.g., electrodermal activity (EDA)] acquired from workers wristband-type biosensors. To achieve this objective, physiological signals were collected from eight construction workers while they performed regular tasks in the field. Various filtering methods were applied to exclude noises recorded in the signal and to extract various features of the signals as workers experienced different risk levels. Then, a supervised machine-learning model was trained to explore the applicability of the collected physiological signals for the prediction of risk perception. The results showed that features based on EDA data collected from wristbands are feasible and useful to the process of continuously monitoring workers perceived risk during ongoing work. This study contributes to an in-depth understanding of construction workers perceived risk by develo** a noninvasive means of continuously monitoring workers perceived risk. △ Less

Submitted 14 August, 2019; originally announced August 2019.

Journal ref: Proceedings of the Creative Construction Conference (CCC 2019)

arXiv:1904.09302 [pdf, other]

Model Predictive Control Framework for Improving Vehicle Cornering Performance Using Handling Characteristics

Authors: Kyoungseok Han, Giseo Park, Gokul S. Sankar, Kanghyun Nam, Seibum B. Choi

Abstract: This paper proposes a new control strategy to improve vehicle cornering performance in a model predictive control framework. The most distinguishing feature of the proposed method is that the natural handling characteristics of the production vehicle is exploited to reduce the complexity of the conventional control methods. For safety s sake, most production vehicles are built to exhibit an unders… ▽ More This paper proposes a new control strategy to improve vehicle cornering performance in a model predictive control framework. The most distinguishing feature of the proposed method is that the natural handling characteristics of the production vehicle is exploited to reduce the complexity of the conventional control methods. For safety s sake, most production vehicles are built to exhibit an understeer handling characteristics to some extent. By monitoring how much the vehicle is biased into the understeer state, the controller attempts to adjust this amount in a way that improves the vehicle cornering performance. With this particular strategy, an innovative controller can be designed without road friction information, which complicates the conventional control methods. In addition, unlike the conventional controllers, the reference yaw rate that is highly dependent on road friction need not be defined due to the proposed control structure. The optimal control problem is formulated in a model predictive control framework to handle the constraints efficiently, and simulations in various test scenarios illustrate the effectiveness of the proposed approach. △ Less

Submitted 14 November, 2019; v1 submitted 19 April, 2019; originally announced April 2019.

Showing 1–15 of 15 results for author: Choi, B