Skip to main content

Showing 1–26 of 26 results for author: Sung, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.09696  [pdf, other

    eess.IV cs.CV

    MoME: Mixture of Multimodal Experts for Cancer Survival Prediction

    Authors: Conghao Xiong, Hao Chen, Hao Zheng, Dong Wei, Yefeng Zheng, Joseph J. Y. Sung, Irwin King

    Abstract: Survival analysis, as a challenging task, requires integrating Whole Slide Images (WSIs) and genomic data for comprehensive decision-making. There are two main challenges in this task: significant heterogeneity and complex inter- and intra-modal interactions between the two modalities. Previous approaches utilize co-attention methods, which fuse features from both modalities only once after separa… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 8 + 1/2 pages, early accepted to MICCAI2024

  2. arXiv:2211.16307  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Controllable speech synthesis by learning discrete phoneme-level prosodic representations

    Authors: Nikolaos Ellinas, Myrsini Christidou, Alexandra Vioni, June Sig Sung, Aimilios Chalamandaris, Pirros Tsiakoulis, Paris Mastorocostas

    Abstract: In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autore… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: Final published version available at: Speech Communication. arXiv admin note: substantial text overlap with arXiv:2111.10168

  3. arXiv:2211.01327  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis

    Authors: Konstantinos Klapsas, Karolos Nikitaras, Nikolaos Ellinas, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at the task of predicting phoneme level prosodic representations extracted with an unsupervised FVAE model. We use both subjective and objective metrics t… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  4. arXiv:2211.00523  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

    Authors: Karolos Nikitaras, Konstantinos Klapsas, Nikolaos Ellinas, Georgia Maniati, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the correspond… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  5. arXiv:2211.00342  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

    Authors: Alexandra Vioni, Georgia Maniati, Nikolaos Ellinas, June Sig Sung, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with re… ▽ More

    Submitted 7 May, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: Proceedings of ICASSP 2023

  6. arXiv:2210.17264   

    cs.SD cs.CL cs.LG eess.AS

    Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

    Authors: Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Georgia Maniati, Panos Kakoulidis, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC)… ▽ More

    Submitted 27 February, 2024; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: Fundamental changes to the model described and experimental procedure

  7. arXiv:2204.05070  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Fine-grained Noise Control for Multispeaker Speech Synthesis

    Authors: Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper pr… ▽ More

    Submitted 27 October, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  8. Karaoker: Alignment-free singing voice synthesis with speech training data

    Authors: Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios Chalamandaris

    Abstract: Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthes… ▽ More

    Submitted 31 August, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  9. arXiv:2204.03421  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Self-supervised learning for robust voice cloning

    Authors: Konstantinos Klapsas, Nikolaos Ellinas, Karolos Nikitaras, Georgios Vamvoukakis, Panos Kakoulidis, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are… ▽ More

    Submitted 2 November, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  10. arXiv:2204.03040  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

    Authors: Georgia Maniati, Alexandra Vioni, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a publ… ▽ More

    Submitted 24 August, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  11. arXiv:2203.14416  [pdf, other

    eess.AS cs.LG cs.SD

    Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge

    Authors: Sangjun Park, Kihyun Choo, Joohyung Lee, Anton V. Porov, Konstantin Osipov, June Sig Sung

    Abstract: Text-to-Speech (TTS) services that run on edge devices have many advantages compared to cloud TTS, e.g., latency and privacy issues. However, neural vocoders with a low complexity and small model footprint inevitably generate annoying sounds. This study proposes a Bunched LPCNet2, an improved LPCNet architecture that provides highly efficient performance in high-quality for cloud servers and in a… ▽ More

    Submitted 30 June, 2022; v1 submitted 27 March, 2022; originally announced March 2022.

    Comments: Interspeech 2022

  12. arXiv:2111.10177  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis

    Authors: Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios Vamvoukakis, Panos Kakoulidis, Taehoon Kim, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Proceedings of ICASSP 2021

  13. arXiv:2111.10173  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Word-Level Style Control for Expressive, Non-attentive Speech Synthesis

    Authors: Konstantinos Klapsas, Nikolaos Ellinas, June Sig Sung, Hyoungmin Park, Spyros Raptis

    Abstract: This paper presents an expressive speech synthesis architecture for modeling and controlling the speaking style at a word level. It attempts to learn word-level stylistic and prosodic representations of the speech data, with the aid of two encoders. The first one models style by finding a combination of style tokens for each word given the acoustic features, and the second outputs a word-level seq… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Proceedings of SPECOM 2021

  14. arXiv:2111.10168  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

    Authors: Myrsini Christidou, Alexandra Vioni, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Panos Kakoulidis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control ra… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Proceedings of SPECOM 2021

  15. arXiv:2111.09146  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Rap**-Singing Voice Synthesis based on Phoneme-level Prosody Control

    Authors: Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini Christidou, Panos Kakoulidis, Georgios Vamvoukakis, Georgia Maniati, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis, Aimilios Chalamandaris

    Abstract: In this paper, a text-to-rap**/singing system is introduced, which can be adapted to any speaker's voice. It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data and which provides prosody control at the phoneme level. Dataset augmentation and additional prosody manipulation based on traditional DSP algorithms are also investigated. The neural TTS model is fine-… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 11)

  16. arXiv:2111.09075  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

    Authors: Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

    Abstract: The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologi… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Proceedings of INTERSPEECH 2021

  17. arXiv:2111.09052  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

    Authors: Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis

    Abstract: This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by usin… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Proceedings of INTERSPEECH 2020

  18. arXiv:2103.14776  [pdf, other

    eess.AS cs.LG cs.SD

    Scalable and Efficient Neural Speech Coding: A Hybrid Design

    Authors: Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beak, Minje Kim

    Abstract: We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (NWC) during its feedforward routine. The proposed NWC also defines quantization and entropy coding as a trainable module, so the coding artifact… ▽ More

    Submitted 27 November, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE/ACM TASLP), 2021 (Accepted for publication)

  19. arXiv:2101.00054  [pdf, other

    cs.SD cs.LG eess.AS

    Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

    Authors: Kai Zhen, Mi Suk Lee, Jongmo Sung, Seungkwon Beack, Minje Kim

    Abstract: Conventional audio coding technologies commonly leverage human perception of sound, or psychoacoustics, to reduce the bitrate while preserving the perceptual quality of the decoded audio signals. For neural audio codecs, however, the objective nature of the loss function usually leads to suboptimal sound quality as well as high run-time complexity due to the large model size. In this work, we pres… ▽ More

    Submitted 31 December, 2020; originally announced January 2021.

    Journal ref: IEEE Signal Processing Letters, vol. 27, pp. 2159-2163, 2020

  20. arXiv:2005.00919  [pdf, other

    eess.SP cs.IT

    Compressed-Sensing based Beam Detection in 5G NR Initial Access

    Authors: Junmo Sung, Brian L. Evans

    Abstract: To support millimeter wave (mmWave) frequency bands in cellular communications, both the base station and the mobile platform utilize large antenna arrays to steer narrow beams towards each other to compensate the path loss and improve communication performance. The time-frequency resource allocated for initial access, however, is limited, which gives rise to need for efficient approaches for beam… ▽ More

    Submitted 2 May, 2020; originally announced May 2020.

    Comments: 5 pages, 6 figures, SPAWC2020

  21. arXiv:2002.05604  [pdf, other

    eess.AS cs.MM cs.SD eess.SP

    Efficient And Scalable Neural Residual Waveform Coding With Collaborative Quantization

    Authors: Kai Zhen, Mi Suk Lee, Jongmo Sung, Seungkwon Beack, Minje Kim

    Abstract: Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network model… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: Accepted in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , Barcelona, Spain, May 4-8, 2020

  22. arXiv:1911.05727  [pdf

    cs.CY cs.IR eess.IV

    Artificial Intelligence Strategies for National Security and Safety Standards

    Authors: Erik Blasch, James Sung, Tao Nguyen, Chandra P. Daniel, Alisa P. Mason

    Abstract: Recent advances in artificial intelligence (AI) have lead to an explosion of multimedia applications (e.g., computer vision (CV) and natural language processing (NLP)) for different domains such as commercial, industrial, and intelligence. In particular, the use of AI applications in a national security environment is often problematic because the opaque nature of the systems leads to an inability… ▽ More

    Submitted 3 November, 2019; originally announced November 2019.

    Comments: Presented at AAAI FSS-19: Artificial Intelligence in Government and Public Sector, Arlington, Virginia, USA

  23. arXiv:1907.00482  [pdf, other

    eess.SP cs.IT

    Base Station Antenna Selection for Low-Resolution ADC Systems

    Authors: **seok Choi, Junmo Sung, Narayan Prasad, Xiao-Feng Qi, Brian L. Evans, Alan Gatherer

    Abstract: This paper investigates antenna selection at a base station with large antenna arrays and low-resolution analog-to-digital converters. For downlink transmit antenna selection for narrowband channels, we show (1) a selection criterion that maximizes sum rate with zero-forcing precoding equivalent to that of a perfect quantization system; (2) maximum sum rate increases with number of selected antenn… ▽ More

    Submitted 30 June, 2019; originally announced July 2019.

    Comments: Submitted to IEEE Transactions on Communications

  24. arXiv:1906.07769  [pdf, other

    eess.AS cs.LG cs.SD

    Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding

    Authors: Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beack, Minje Kim

    Abstract: Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. C… ▽ More

    Submitted 13 September, 2019; v1 submitted 18 June, 2019; originally announced June 2019.

    Comments: Accepted for publication in INTERSPEECH 2019

    Journal ref: Published in Interspeech 2019

  25. arXiv:1808.04530  [pdf, other

    eess.SP

    Hybrid Powerline/Wireless Diversity for Smart Grid Communications: Design Challenges and Real-time Implementation

    Authors: Junmo Sung, Mostafa Sayed, Mahmoud Elgenedy, Brian L. Evans, Naofal Al-Dhahir, Il Han Kim, Khurram Waheed

    Abstract: The demand for energy is growing at an unprecedented pace that is much higher than the energy generation capacity growth rate using both conventional and green technologies.In particular, the electric power sector is consistently rated among the most dynamic growth markets over all other energy markets. Distributed (decentralized) energy generation based on renewable energy sources is an efficient… ▽ More

    Submitted 14 August, 2018; originally announced August 2018.

    Comments: IEEE Communications Magazine, submitted July 5, 2018

  26. arXiv:1801.09774  [pdf, other

    cs.SD eess.AS

    On Psychoacoustically Weighted Cost Functions Towards Resource-Efficient Deep Neural Networks for Speech Denoising

    Authors: Kai Zhen, Aswin Sivaraman, Jongmo Sung, Minje Kim

    Abstract: We present a psychoacoustically enhanced cost function to balance network complexity and perceptual performance of deep neural networks for speech denoising. While training the network, we utilize perceptual weights added to the ordinary mean-squared error to emphasize contribution from frequency bins which are most audible while ignoring error from inaudible bins. To generate the weights, we empl… ▽ More

    Submitted 29 January, 2018; originally announced January 2018.

    Comments: 5 pages, 4 figures