Skip to main content

Showing 1–22 of 22 results for author: Song, E

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.05706  [pdf, other

    cs.CL cs.SD eess.AS

    Unified Speech-Text Pretraining for Spoken Dialog Modeling

    Authors: Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Sungroh Yoon, Kang Min Yoo

    Abstract: While recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech, an LLM-based strategy for modeling spoken dialogs remains elusive and calls for further investigation. This work proposes an extensive speech-text LLM framework, named the Unified Spoken Dialog Model (USDM), to generate coherent spoken responses with… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  2. Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech

    Authors: Hyungchan Yoon, Changhwan Kim, Eunwoo Song, Hyun-Wook Yoon, Hong-Goo Kang

    Abstract: For personalized speech generation, a neural text-to-speech (TTS) model must be successfully implemented with limited data from a target speaker. To this end, the baseline TTS model needs to be amply generalized to out-of-domain data (i.e., target speaker's speech). However, approaches to address this out-of-domain generalization problem in TTS have yet to be thoroughly studied. In this work, we p… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: INTERSPEECH 2023

    Journal ref: Proc. INTERSPEECH 2023, 4299-4303

  3. arXiv:2305.12805  [pdf, ps, other

    eess.SP

    Decentralized Equalization for Massive MIMO Systems With Colored Noise Samples

    Authors: Xiaotong Zhao, Mian Li, Bo Wang, Enbin Song, Tsung-Hui Chang, Qingjiang Shi

    Abstract: Recently, the decentralized baseband processing (DBP) paradigm and relevant detection methods have been proposed to enable extremely large-scale massive multiple-input multiple-output technology. Under the DBP architecture, base station antennas are divided into several independent clusters, each connected to a local computing fabric. However, current detection methods tailored to DBP only conside… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  4. arXiv:2211.14986  [pdf

    eess.IV cs.CV

    An Unpaired Cross-modality Segmentation Framework Using Data Augmentation and Hybrid Convolutional Networks for Segmenting Vestibular Schwannoma and Cochlea

    Authors: Yuzhou Zhuang, Hong Liu, Enmin Song, Coskun Cetinkaya, Chih-Cheng Hung

    Abstract: The crossMoDA challenge aims to automatically segment the vestibular schwannoma (VS) tumor and cochlea regions of unlabeled high-resolution T2 scans by leveraging labeled contrast-enhanced T1 scans. The 2022 edition extends the segmentation task by including multi-institutional scans. In this work, we proposed an unpaired cross-modality segmentation framework using data augmentation and hybrid con… ▽ More

    Submitted 27 November, 2022; originally announced November 2022.

    Comments: Accepted by BrainLes MICCAI proceedings

  5. arXiv:2210.15964  [pdf, other

    eess.AS cs.LG cs.SD

    Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis

    Authors: Yuma Shirahata, Ryuichi Yamamoto, Eunwoo Song, Ryo Terashima, Jae-Min Kim, Kentaro Tachibana

    Abstract: Several fully end-to-end text-to-speech (TTS) models have been proposed that have shown better performance compared to cascade models (i.e., training acoustic and vocoder models separately). However, they often generate unstable pitch contour with audible artifacts when the dataset contains emotional attributes, i.e., large diversity of pronunciation and prosody. To address this problem, we propos… ▽ More

    Submitted 21 February, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: ICASSP 2023

  6. arXiv:2210.15917  [pdf, other

    cs.IT eess.SP

    Low-Complexity Channel Estimation for Massive MIMO Systems with Decentralized Baseband Processing

    Authors: Yanqing Xu, Bo Wang, Enbin Song, Qingjiang Shi, Tsung-Hui Chang

    Abstract: The traditional centralized baseband processing architecture is faced with the bottlenecks of high computation complexity and excessive fronthaul communication, especially when the number of antennas at the base station (BS) is large. To cope with these two challenges, the decentralized baseband processing (DPB) architecture has been proposed, where the BS antennas are partitioned into multiple cl… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: Submitted for publication

  7. arXiv:2206.15067  [pdf, other

    cs.SD eess.AS

    Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems

    Authors: Hyun-Wook Yoon, Ohsung Kwon, Hoyeon Lee, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim, Min-Jae Hwang

    Abstract: This paper proposes an effective emotional text-to-speech (TTS) system with a pre-trained language model (LM)-based emotion prediction method. Unlike conventional systems that require auxiliary inputs such as manually defined emotion classes, our system directly estimates emotion-related attributes from the input text. Specifically, we utilize generative pre-trained transformer (GPT)-3 to jointly… ▽ More

    Submitted 30 June, 2022; v1 submitted 30 June, 2022; originally announced June 2022.

    Comments: Accepted by INTERSPEECH2022

  8. arXiv:2206.14984  [pdf, other

    eess.AS cs.SD

    TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder

    Authors: Eunwoo Song, Ryuichi Yamamoto, Ohsung Kwon, Chan-Ho Song, Min-Jae Hwang, Suhyeon Oh, Hyun-Wook Yoon, **-Seob Kim, Jae-Min Kim

    Abstract: Recent advances in synthetic speech quality have enabled us to train text-to-speech (TTS) systems by using synthetic corpora. However, merely increasing the amount of synthetic data is not always advantageous for improving training efficiency. Our aim in this study is to selectively choose synthetic data that are beneficial to the training process. In the proposed method, we first adopt a variatio… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted to the conference of INTERSPEECH 2022

  9. arXiv:2204.10020  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation

    Authors: Ryo Terashima, Ryuichi Yamamoto, Eunwoo Song, Yuma Shirahata, Hyun-Wook Yoon, Jae-Min Kim, Kentaro Tachibana

    Abstract: Data augmentation via voice conversion (VC) has been successfully applied to low-resource expressive text-to-speech (TTS) when only neutral data for the target speaker are available. Although the quality of VC is crucial for this approach, it is challenging to learn a stable VC model because the amount of data is limited in low-resource scenarios, and highly expressive speech has large acoustic va… ▽ More

    Submitted 5 July, 2022; v1 submitted 21 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  10. arXiv:2101.07412  [pdf, other

    eess.AS cs.SD

    Improved parallel WaveGAN vocoder with perceptually weighted spectrogram loss

    Authors: Eunwoo Song, Ryuichi Yamamoto, Min-Jae Hwang, **-Seob Kim, Ohsung Kwon, Jae-Min Kim

    Abstract: This paper proposes a spectral-domain perceptual weighting technique for Parallel WaveGAN-based text-to-speech (TTS) systems. The recently proposed Parallel WaveGAN vocoder successfully generates waveform sequences using a fast non-autoregressive WaveNet model. By employing multi-resolution short-time Fourier transform (MR-STFT) criteria with a generative adversarial network, the light-weight conv… ▽ More

    Submitted 18 January, 2021; originally announced January 2021.

    Comments: To appear in SLT 2021

  11. arXiv:2010.14151  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators

    Authors: Ryuichi Yamamoto, Eunwoo Song, Min-Jae Hwang, Jae-Min Kim

    Abstract: This paper proposes voicing-aware conditional discriminators for Parallel WaveGAN-based waveform synthesis systems. In this framework, we adopt a projection-based conditioning method that can significantly improve the discriminator's performance. Furthermore, the conventional discriminator is separated into two waveform discriminators for modeling voiced and unvoiced speech. As each discriminator… ▽ More

    Submitted 26 April, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: Accepted to the conference of ICASSP 2021

  12. arXiv:2010.13421  [pdf, other

    eess.AS cs.SD

    TTS-by-TTS: TTS-driven Data Augmentation for Fast and High-Quality Speech Synthesis

    Authors: Min-Jae Hwang, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim

    Abstract: In this paper, we propose a text-to-speech (TTS)-driven data augmentation method for improving the quality of a non-autoregressive (AR) TTS system. Recently proposed non-AR models, such as FastSpeech 2, have successfully achieved fast speech synthesis system. However, their quality is not satisfactory, especially when the amount of training data is insufficient. To address this problem, we propose… ▽ More

    Submitted 26 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP 2021

  13. arXiv:2008.00132  [pdf, other

    eess.AS

    Neural text-to-speech with a modeling-by-generation excitation vocoder

    Authors: Eunwoo Song, Min-Jae Hwang, Ryuichi Yamamoto, **-Seob Kim, Ohsung Kwon, Jae-Min Kim

    Abstract: This paper proposes a modeling-by-generation (MbG) excitation vocoder for a neural text-to-speech (TTS) system. Recently proposed neural excitation vocoders can realize qualified waveform generation by combining a vocal tract filter with a WaveNet-based glottal excitation generator. However, when these vocoders are used in a TTS system, the quality of synthesized speech is often degraded owing to… ▽ More

    Submitted 31 July, 2020; originally announced August 2020.

    Comments: Accepted to the conference of INTERSPEECH 2020

  14. arXiv:2001.11686  [pdf, other

    eess.AS

    Improving LPCNet-based Text-to-Speech with Linear Prediction-structured Mixture Density Network

    Authors: Min-Jae Hwang, Eunwoo Song, Ryuichi Yamamoto, Frank Soong, Hong-Goo Kang

    Abstract: In this paper, we propose an improved LPCNet vocoder using a linear prediction (LP)-structured mixture density network (MDN). The recently proposed LPCNet vocoder has successfully achieved high-quality and lightweight speech synthesis systems by combining a vocal tract LP filter with a WaveRNN-based vocal source (i.e., excitation) generator. However, the quality of synthesized speech is often unst… ▽ More

    Submitted 31 January, 2020; originally announced January 2020.

    Comments: Accepted to ICASSP 2020

    Journal ref: IEEE ICASSP 2020

  15. arXiv:1910.11480  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

    Authors: Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim

    Abstract: We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method… ▽ More

    Submitted 6 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

    Comments: Accepted to the conference of ICASSP 2020

  16. arXiv:1905.08486  [pdf, other

    eess.AS cs.LG cs.SD

    Effective parameter estimation methods for an ExcitNet model in generative text-to-speech systems

    Authors: Ohsung Kwon, Eunwoo Song, Jae-Min Kim, Hong-Goo Kang

    Abstract: In this paper, we propose a high-quality generative text-to-speech (TTS) system using an effective spectrum and excitation estimation method. Our previous research verified the effectiveness of the ExcitNet-based speech generation model in a parametric TTS framework. However, the challenge remains to build a high-quality speech synthesis system because auxiliary conditional features estimated by a… ▽ More

    Submitted 21 May, 2019; originally announced May 2019.

    Comments: 5 pages, 3 figures, 3 tables, submitted to Speech Synthesis Workshop 2019

  17. arXiv:1905.08413  [pdf

    cs.CV eess.IV

    Dual-branch residual network for lung nodule segmentation

    Authors: Haichao Cao, Hong Liu, Enmin Song, Chih-Cheng Hung, Guangzhi Ma, Xiangyang Xu, Renchao **, Jianguo Lu

    Abstract: An accurate segmentation of lung nodules in computed tomography (CT) images is critical to lung cancer analysis and diagnosis. However, due to the variety of lung nodules and the similarity of visual characteristics between nodules and their surroundings, a robust segmentation of nodules becomes a challenging problem. In this study, we propose the Dual-branch Residual Network (DB-ResNet) which is… ▽ More

    Submitted 20 May, 2019; originally announced May 2019.

    Comments: 24 pages, 6 figures

  18. arXiv:1905.03445  [pdf

    cs.CV eess.IV

    Two-Stage Convolutional Neural Network Architecture for Lung Nodule Detection

    Authors: Haichao Cao, Hong Liu, Enmin Song, Guangzhi Ma, Xiangyang Xu, Renchao **, Tengying Liu, Chih-Cheng Hung

    Abstract: Early detection of lung cancer is an effective way to improve the survival rate of patients. It is a critical step to have accurate detection of lung nodules in computed tomography (CT) images for the diagnosis of lung cancer. However, due to the heterogeneity of the lung nodules and the complexity of the surrounding environment, robust nodule detection has been a challenging task. In this study,… ▽ More

    Submitted 9 May, 2019; originally announced May 2019.

    Comments: 29 pages, 10 figures

  19. arXiv:1904.04472  [pdf, other

    eess.AS cs.SD

    Probability density distillation with generative adversarial networks for high-quality parallel waveform generation

    Authors: Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim

    Abstract: This paper proposes an effective probability density distillation (PDD) algorithm for WaveNet-based parallel waveform generation (PWG) systems. Recently proposed teacher-student frameworks in the PWG system have successfully achieved a real-time generation of speech signals. However, the difficulties optimizing the PDD criteria without auxiliary losses result in quality degradation of synthesized… ▽ More

    Submitted 27 August, 2019; v1 submitted 9 April, 2019; originally announced April 2019.

    Comments: Accepted to the conference of INTERSPEECH 2019

  20. arXiv:1811.11913  [pdf, other

    eess.AS cs.SD

    LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis

    Authors: Min-Jae Hwang, Frank Soong, Eunwoo Song, Xi Wang, Hyeonjoo Kang, Hong-Goo Kang

    Abstract: We propose a linear prediction (LP)-based waveform generation method via WaveNet vocoding framework. A WaveNet-based neural vocoder has significantly improved the quality of parametric text-to-speech (TTS) systems. However, it is challenging to effectively train the neural vocoder when the target database contains massive amount of acoustical information such as prosody, style or expressiveness. A… ▽ More

    Submitted 4 March, 2020; v1 submitted 28 November, 2018; originally announced November 2018.

    Comments: Submitted to EUSIPCO 2020

  21. arXiv:1811.04769  [pdf, other

    eess.AS cs.LG cs.SD

    ExcitNet vocoder: A neural excitation model for parametric speech synthesis systems

    Authors: Eunwoo Song, Kyungguen Byun, Hong-Goo Kang

    Abstract: This paper proposes a WaveNet-based neural excitation model (ExcitNet) for statistical parametric speech synthesis systems. Conventional WaveNet-based neural vocoding systems significantly improve the perceptual quality of synthesized speech by statistically generating a time sequence of speech waveforms through an auto-regressive framework. However, they often suffer from noisy outputs because of… ▽ More

    Submitted 21 August, 2019; v1 submitted 9 November, 2018; originally announced November 2018.

    Comments: Accepted to the conference of EUSIPCO 2019. arXiv admin note: text overlap with arXiv:1811.03311

  22. arXiv:1811.03311  [pdf, other

    eess.AS cs.LG cs.SD

    Speaker-adaptive neural vocoders for parametric speech synthesis systems

    Authors: Eunwoo Song, **-Seob Kim, Kyungguen Byun, Hong-Goo Kang

    Abstract: This paper proposes speaker-adaptive neural vocoders for parametric text-to-speech (TTS) systems. Recently proposed WaveNet-based neural vocoding systems successfully generate a time sequence of speech signal with an autoregressive framework. However, it remains a challenge to synthesize high-quality speech when the amount of a target speaker's training data is insufficient. To generate more natur… ▽ More

    Submitted 1 August, 2020; v1 submitted 8 November, 2018; originally announced November 2018.

    Comments: Accepted to the IEEE Workshop of MMSP 2020