Skip to main content

Showing 1–16 of 16 results for author: Sung, J S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2211.16307  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Controllable speech synthesis by learning discrete phoneme-level prosodic representations

    Authors: Nikolaos Ellinas, Myrsini Christidou, Alexandra Vioni, June Sig Sung, Aimilios Chalamandaris, Pirros Tsiakoulis, Paris Mastorocostas

    Abstract: In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autore… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: Final published version available at: Speech Communication. arXiv admin note: substantial text overlap with arXiv:2111.10168

  2. arXiv:2211.01327  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis

    Authors: Konstantinos Klapsas, Karolos Nikitaras, Nikolaos Ellinas, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at the task of predicting phoneme level prosodic representations extracted with an unsupervised FVAE model. We use both subjective and objective metrics t… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  3. arXiv:2211.00523  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

    Authors: Karolos Nikitaras, Konstantinos Klapsas, Nikolaos Ellinas, Georgia Maniati, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the correspond… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  4. arXiv:2211.00342  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

    Authors: Alexandra Vioni, Georgia Maniati, Nikolaos Ellinas, June Sig Sung, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with re… ▽ More

    Submitted 7 May, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: Proceedings of ICASSP 2023

  5. arXiv:2210.17264   

    cs.SD cs.CL cs.LG eess.AS

    Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

    Authors: Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Georgia Maniati, Panos Kakoulidis, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC)… ▽ More

    Submitted 27 February, 2024; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: Fundamental changes to the model described and experimental procedure

  6. arXiv:2204.05070  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Fine-grained Noise Control for Multispeaker Speech Synthesis

    Authors: Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper pr… ▽ More

    Submitted 27 October, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  7. Karaoker: Alignment-free singing voice synthesis with speech training data

    Authors: Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios Chalamandaris

    Abstract: Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthes… ▽ More

    Submitted 31 August, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  8. arXiv:2204.03421  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Self-supervised learning for robust voice cloning

    Authors: Konstantinos Klapsas, Nikolaos Ellinas, Karolos Nikitaras, Georgios Vamvoukakis, Panos Kakoulidis, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are… ▽ More

    Submitted 2 November, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  9. arXiv:2204.03040  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

    Authors: Georgia Maniati, Alexandra Vioni, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a publ… ▽ More

    Submitted 24 August, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  10. arXiv:2203.14416  [pdf, other

    eess.AS cs.LG cs.SD

    Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge

    Authors: Sangjun Park, Kihyun Choo, Joohyung Lee, Anton V. Porov, Konstantin Osipov, June Sig Sung

    Abstract: Text-to-Speech (TTS) services that run on edge devices have many advantages compared to cloud TTS, e.g., latency and privacy issues. However, neural vocoders with a low complexity and small model footprint inevitably generate annoying sounds. This study proposes a Bunched LPCNet2, an improved LPCNet architecture that provides highly efficient performance in high-quality for cloud servers and in a… ▽ More

    Submitted 30 June, 2022; v1 submitted 27 March, 2022; originally announced March 2022.

    Comments: Interspeech 2022

  11. arXiv:2111.10177  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis

    Authors: Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios Vamvoukakis, Panos Kakoulidis, Taehoon Kim, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Proceedings of ICASSP 2021

  12. arXiv:2111.10173  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Word-Level Style Control for Expressive, Non-attentive Speech Synthesis

    Authors: Konstantinos Klapsas, Nikolaos Ellinas, June Sig Sung, Hyoungmin Park, Spyros Raptis

    Abstract: This paper presents an expressive speech synthesis architecture for modeling and controlling the speaking style at a word level. It attempts to learn word-level stylistic and prosodic representations of the speech data, with the aid of two encoders. The first one models style by finding a combination of style tokens for each word given the acoustic features, and the second outputs a word-level seq… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Proceedings of SPECOM 2021

  13. arXiv:2111.10168  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

    Authors: Myrsini Christidou, Alexandra Vioni, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Panos Kakoulidis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control ra… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Proceedings of SPECOM 2021

  14. arXiv:2111.09146  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Rap**-Singing Voice Synthesis based on Phoneme-level Prosody Control

    Authors: Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini Christidou, Panos Kakoulidis, Georgios Vamvoukakis, Georgia Maniati, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis, Aimilios Chalamandaris

    Abstract: In this paper, a text-to-rap**/singing system is introduced, which can be adapted to any speaker's voice. It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data and which provides prosody control at the phoneme level. Dataset augmentation and additional prosody manipulation based on traditional DSP algorithms are also investigated. The neural TTS model is fine-… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 11)

  15. arXiv:2111.09075  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

    Authors: Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologi… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Proceedings of INTERSPEECH 2021

  16. arXiv:2111.09052  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

    Authors: Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis

    Abstract: This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by usin… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Proceedings of INTERSPEECH 2020