-
MoME: Mixture of Multimodal Experts for Cancer Survival Prediction
Authors:
Conghao Xiong,
Hao Chen,
Hao Zheng,
Dong Wei,
Yefeng Zheng,
Joseph J. Y. Sung,
Irwin King
Abstract:
Survival analysis, as a challenging task, requires integrating Whole Slide Images (WSIs) and genomic data for comprehensive decision-making. There are two main challenges in this task: significant heterogeneity and complex inter- and intra-modal interactions between the two modalities. Previous approaches utilize co-attention methods, which fuse features from both modalities only once after separa…
▽ More
Survival analysis, as a challenging task, requires integrating Whole Slide Images (WSIs) and genomic data for comprehensive decision-making. There are two main challenges in this task: significant heterogeneity and complex inter- and intra-modal interactions between the two modalities. Previous approaches utilize co-attention methods, which fuse features from both modalities only once after separate encoding. However, these approaches are insufficient for modeling the complex task due to the heterogeneous nature between the modalities. To address these issues, we propose a Biased Progressive Encoding (BPE) paradigm, performing encoding and fusion simultaneously. This paradigm uses one modality as a reference when encoding the other. It enables deep fusion of the modalities through multiple alternating iterations, progressively reducing the cross-modal disparities and facilitating complementary interactions. Besides modality heterogeneity, survival analysis involves various biomarkers from WSIs, genomics, and their combinations. The critical biomarkers may exist in different modalities under individual variations, necessitating flexible adaptation of the models to specific scenarios. Therefore, we further propose a Mixture of Multimodal Experts (MoME) layer to dynamically selects tailored experts in each stage of the BPE paradigm. Experts incorporate reference information from another modality to varying degrees, enabling a balanced or biased focus on different modalities during the encoding process. Extensive experimental results demonstrate the superior performance of our method on various datasets, including TCGA-BLCA, TCGA-UCEC and TCGA-LUAD. Codes are available at https://github.com/BearCleverProud/MoME.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Controllable speech synthesis by learning discrete phoneme-level prosodic representations
Authors:
Nikolaos Ellinas,
Myrsini Christidou,
Alexandra Vioni,
June Sig Sung,
Aimilios Chalamandaris,
Pirros Tsiakoulis,
Paris Mastorocostas
Abstract:
In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autore…
▽ More
In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis
Authors:
Konstantinos Klapsas,
Karolos Nikitaras,
Nikolaos Ellinas,
June Sig Sung,
Inchul Hwang,
Spyros Raptis,
Aimilios Chalamandaris,
Pirros Tsiakoulis
Abstract:
A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at the task of predicting phoneme level prosodic representations extracted with an unsupervised FVAE model. We use both subjective and objective metrics t…
▽ More
A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at the task of predicting phoneme level prosodic representations extracted with an unsupervised FVAE model. We use both subjective and objective metrics to show that normalizing flow based prior networks can result in more expressive speech at the cost of a slight drop in quality. Furthermore, we show that the synthesized speech has higher variability, for a given text, due to the nature of normalizing flows. We also propose a Dynamical VAE model, that can generate higher quality speech although with decreased expressiveness and variability compared to the flow based models.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis
Authors:
Karolos Nikitaras,
Konstantinos Klapsas,
Nikolaos Ellinas,
Georgia Maniati,
June Sig Sung,
Inchul Hwang,
Spyros Raptis,
Aimilios Chalamandaris,
Pirros Tsiakoulis
Abstract:
This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the correspond…
▽ More
This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the corresponding level. We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. Therefore, a trade-off arises between the diversity of the token-level and utterance-level representations and their disentanglement. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, posterior latents extracted during the previous step. Both qualitative and quantitative evaluations are used to demonstrate the effectiveness of the proposed approach. Audio samples are available in our demo page.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features
Authors:
Alexandra Vioni,
Georgia Maniati,
Nikolaos Ellinas,
June Sig Sung,
Inchul Hwang,
Aimilios Chalamandaris,
Pirros Tsiakoulis
Abstract:
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with re…
▽ More
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with regard to the spoken content is a decisive factor for speech naturalness. For this reason, we propose to include prosodic and linguistic features as additional inputs in MOS prediction systems, and evaluate their impact on the prediction outcome. We consider phoneme level F0 and duration features as prosodic inputs, as well as Tacotron encoder outputs, POS tags and BERT embeddings as higher-level linguistic inputs. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations. Results show that the proposed additional features are beneficial in the MOS prediction task, by improving the predicted MOS scores' correlation with the ground truths, both at utterance-level and system-level predictions.
△ Less
Submitted 7 May, 2023; v1 submitted 1 November, 2022;
originally announced November 2022.
-
Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation
Authors:
Nikolaos Ellinas,
Georgios Vamvoukakis,
Konstantinos Markopoulos,
Georgia Maniati,
Panos Kakoulidis,
June Sig Sung,
Inchul Hwang,
Spyros Raptis,
Aimilios Chalamandaris,
Pirros Tsiakoulis
Abstract:
This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC)…
▽ More
This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC) to be performed by the same model due to the inherent linguistic content and speaker identity disentanglement. When used in a cross-lingual setting, acoustic features are initially produced with a native speaker of the target language and then voice conversion is applied by the same model in order to convert these features to the target speaker's voice. We verify through objective and subjective evaluations that our method can have benefits compared to baseline cross-lingual synthesis. By including speakers averaging 7.5 minutes of speech, we also present positive results on low-resource scenarios.
△ Less
Submitted 27 February, 2024; v1 submitted 31 October, 2022;
originally announced October 2022.
-
Fine-grained Noise Control for Multispeaker Speech Synthesis
Authors:
Karolos Nikitaras,
Georgios Vamvoukakis,
Nikolaos Ellinas,
Konstantinos Klapsas,
Konstantinos Markopoulos,
Spyros Raptis,
June Sig Sung,
Gunu Jho,
Aimilios Chalamandaris,
Pirros Tsiakoulis
Abstract:
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper pr…
▽ More
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper proposes unsupervised, interpretable and fine-grained noise and prosody modeling. We incorporate adversarial training, representation bottleneck and utterance-to-frame modeling in order to learn frame-level noise representations. To the same end, we perform fine-grained prosody modeling via a Fully Hierarchical Variational AutoEncoder (FVAE) which additionally results in more expressive speech synthesis.
△ Less
Submitted 27 October, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
Karaoker: Alignment-free singing voice synthesis with speech training data
Authors:
Panos Kakoulidis,
Nikolaos Ellinas,
Georgios Vamvoukakis,
Konstantinos Markopoulos,
June Sig Sung,
Gunu Jho,
Pirros Tsiakoulis,
Aimilios Chalamandaris
Abstract:
Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthes…
▽ More
Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice and transfers style following a multi-dimensional template extracted from a source waveform of an unseen singer/speaker. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. In addition to multitasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model.
△ Less
Submitted 31 August, 2022; v1 submitted 8 April, 2022;
originally announced April 2022.
-
Self-supervised learning for robust voice cloning
Authors:
Konstantinos Klapsas,
Nikolaos Ellinas,
Karolos Nikitaras,
Georgios Vamvoukakis,
Panos Kakoulidis,
Konstantinos Markopoulos,
Spyros Raptis,
June Sig Sung,
Gunu Jho,
Aimilios Chalamandaris,
Pirros Tsiakoulis
Abstract:
Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are…
▽ More
Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming to achieve multispeaker speech synthesis without utilizing additional speaker features. This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice. Subjective and objective evaluations are used to validate the proposed model, as well as the robustness to the acoustic conditions of the target utterance.
△ Less
Submitted 2 November, 2022; v1 submitted 7 April, 2022;
originally announced April 2022.
-
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis
Authors:
Georgia Maniati,
Alexandra Vioni,
Nikolaos Ellinas,
Karolos Nikitaras,
Konstantinos Klapsas,
June Sig Sung,
Gunu Jho,
Aimilios Chalamandaris,
Pirros Tsiakoulis
Abstract:
In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a publ…
▽ More
In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset which is a common benchmark for building neural acoustic models and vocoders. Utterances are generated from 200 TTS systems including vanilla neural acoustic models as well as models which allow prosodic variations. An LPCNet vocoder is used for all systems, so that the samples' variation depends only on the acoustic models. The synthesized utterances provide balanced and adequate domain and length coverage. We collect MOS naturalness evaluations on 3 English Amazon Mechanical Turk locales and share practices leading to reliable crowdsourced annotations for this task. We provide baseline results of state-of-the-art MOS prediction models on the SOMOS dataset and show the limitations that such models face when assigned to evaluate TTS utterances.
△ Less
Submitted 24 August, 2022; v1 submitted 6 April, 2022;
originally announced April 2022.
-
Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge
Authors:
Sangjun Park,
Kihyun Choo,
Joohyung Lee,
Anton V. Porov,
Konstantin Osipov,
June Sig Sung
Abstract:
Text-to-Speech (TTS) services that run on edge devices have many advantages compared to cloud TTS, e.g., latency and privacy issues. However, neural vocoders with a low complexity and small model footprint inevitably generate annoying sounds. This study proposes a Bunched LPCNet2, an improved LPCNet architecture that provides highly efficient performance in high-quality for cloud servers and in a…
▽ More
Text-to-Speech (TTS) services that run on edge devices have many advantages compared to cloud TTS, e.g., latency and privacy issues. However, neural vocoders with a low complexity and small model footprint inevitably generate annoying sounds. This study proposes a Bunched LPCNet2, an improved LPCNet architecture that provides highly efficient performance in high-quality for cloud servers and in a low-complexity for low-resource edge devices. Single logistic distribution achieves computational efficiency, and insightful tricks reduce the model footprint while maintaining speech quality. A DualRate architecture, which generates a lower sampling rate from a prosody model, is also proposed to reduce maintenance costs. The experiments demonstrate that Bunched LPCNet2 generates satisfactory speech quality with a model footprint of 1.1MB while operating faster than real-time on a RPi 3B. Our audio samples are available at https://srtts.github.io/bunchedLPCNet2.
△ Less
Submitted 30 June, 2022; v1 submitted 27 March, 2022;
originally announced March 2022.
-
Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis
Authors:
Alexandra Vioni,
Myrsini Christidou,
Nikolaos Ellinas,
Georgios Vamvoukakis,
Panos Kakoulidis,
Taehoon Kim,
June Sig Sung,
Hyoungmin Park,
Aimilios Chalamandaris,
Pirros Tsiakoulis
Abstract:
This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering…
▽ More
This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
△ Less
Submitted 19 November, 2021;
originally announced November 2021.
-
Word-Level Style Control for Expressive, Non-attentive Speech Synthesis
Authors:
Konstantinos Klapsas,
Nikolaos Ellinas,
June Sig Sung,
Hyoungmin Park,
Spyros Raptis
Abstract:
This paper presents an expressive speech synthesis architecture for modeling and controlling the speaking style at a word level. It attempts to learn word-level stylistic and prosodic representations of the speech data, with the aid of two encoders. The first one models style by finding a combination of style tokens for each word given the acoustic features, and the second outputs a word-level seq…
▽ More
This paper presents an expressive speech synthesis architecture for modeling and controlling the speaking style at a word level. It attempts to learn word-level stylistic and prosodic representations of the speech data, with the aid of two encoders. The first one models style by finding a combination of style tokens for each word given the acoustic features, and the second outputs a word-level sequence conditioned only on the phonetic information in order to disentangle it from the style information. The two encoder outputs are aligned and concatenated with the phoneme encoder outputs and then decoded with a Non-Attentive Tacotron model. An extra prior encoder is used to predict the style tokens autoregressively, in order for the model to be able to run without a reference utterance. We find that the resulting model gives both word-level and global control over the style, as well as prosody transfer capabilities.
△ Less
Submitted 19 November, 2021;
originally announced November 2021.
-
Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control
Authors:
Myrsini Christidou,
Alexandra Vioni,
Nikolaos Ellinas,
Georgios Vamvoukakis,
Konstantinos Markopoulos,
Panos Kakoulidis,
June Sig Sung,
Hyoungmin Park,
Aimilios Chalamandaris,
Pirros Tsiakoulis
Abstract:
This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control ra…
▽ More
This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control range and coverage. More specifically we employ data augmentation, F0 normalization, balanced clustering for duration, and speaker-independent prosodic clustering. These modifications enable fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. The model is also fine-tuned to unseen speakers with limited amounts of data and it is shown to maintain its prosody control capabilities, verifying that the speaker-independent prosodic clustering is effective. Experimental results verify that the model maintains high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.
△ Less
Submitted 19 November, 2021;
originally announced November 2021.
-
Rap**-Singing Voice Synthesis based on Phoneme-level Prosody Control
Authors:
Konstantinos Markopoulos,
Nikolaos Ellinas,
Alexandra Vioni,
Myrsini Christidou,
Panos Kakoulidis,
Georgios Vamvoukakis,
Georgia Maniati,
June Sig Sung,
Hyoungmin Park,
Pirros Tsiakoulis,
Aimilios Chalamandaris
Abstract:
In this paper, a text-to-rap**/singing system is introduced, which can be adapted to any speaker's voice. It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data and which provides prosody control at the phoneme level. Dataset augmentation and additional prosody manipulation based on traditional DSP algorithms are also investigated. The neural TTS model is fine-…
▽ More
In this paper, a text-to-rap**/singing system is introduced, which can be adapted to any speaker's voice. It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data and which provides prosody control at the phoneme level. Dataset augmentation and additional prosody manipulation based on traditional DSP algorithms are also investigated. The neural TTS model is fine-tuned to an unseen speaker's limited recordings, allowing rap**/singing synthesis with the target's speaker voice. The detailed pipeline of the system is described, which includes the extraction of the target pitch and duration values from an a capella song and their conversion into target speaker's valid range of notes before synthesis. An additional stage of prosodic manipulation of the output via WSOLA is also investigated for better matching the target duration values. The synthesized utterances can be mixed with an instrumental accompaniment track to produce a complete song. The proposed system is evaluated via subjective listening tests as well as in comparison to an available alternate system which also aims to produce synthetic singing voice from read-only training data. Results show that the proposed approach can produce high quality rap**/singing voice with increased naturalness.
△ Less
Submitted 17 November, 2021;
originally announced November 2021.
-
Cross-lingual Low Resource Speaker Adaptation Using Phonological Features
Authors:
Georgia Maniati,
Nikolaos Ellinas,
Konstantinos Markopoulos,
Georgios Vamvoukakis,
June Sig Sung,
Hyoungmin Park,
Aimilios Chalamandaris,
Pirros Tsiakoulis
Abstract:
The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologi…
▽ More
The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages, with the goal of achieving cross-lingual speaker adaptation. We first experiment with the effect of language phonological similarity on cross-lingual TTS of several source-target language combinations. Subsequently, we fine-tune the model with very limited data of a new speaker's voice in either a seen or an unseen language, and achieve synthetic speech of equal quality, while preserving the target speaker's identity. With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature. In the extreme case of only 2 available adaptation utterances, we find that our model behaves as a few-shot learner, as the performance is similar in both the seen and unseen adaptation language scenarios.
△ Less
Submitted 17 November, 2021;
originally announced November 2021.
-
High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency
Authors:
Nikolaos Ellinas,
Georgios Vamvoukakis,
Konstantinos Markopoulos,
Aimilios Chalamandaris,
Georgia Maniati,
Panos Kakoulidis,
Spyros Raptis,
June Sig Sung,
Hyoungmin Park,
Pirros Tsiakoulis
Abstract:
This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by usin…
▽ More
This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by using a recently proposed purely location-based attention mechanism, suitable for arbitrary sentence length generation. During inference, the decoder is unrolled and acoustic feature generation is performed in a streaming manner, allowing for a nearly constant latency which is independent from the sentence length. Experimental results show that the acoustic model can produce feature sequences with minimal latency about 31 times faster than real-time on a computer CPU and 6.5 times on a mobile CPU, enabling it to meet the conditions required for real-time applications on both devices. The full end-to-end system can generate almost natural quality speech, which is verified by listening tests.
△ Less
Submitted 17 November, 2021;
originally announced November 2021.
-
Scalable and Efficient Neural Speech Coding: A Hybrid Design
Authors:
Kai Zhen,
Jongmo Sung,
Mi Suk Lee,
Seungkwon Beak,
Minje Kim
Abstract:
We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (NWC) during its feedforward routine. The proposed NWC also defines quantization and entropy coding as a trainable module, so the coding artifact…
▽ More
We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (NWC) during its feedforward routine. The proposed NWC also defines quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process. We achieve efficiency by introducing compact model components to NWC, such as gated residual networks and depthwise separable convolution. Furthermore, the proposed models are with a scalable architecture, cross-module residual learning (CMRL), to cover a wide range of bitrates. To this end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where each NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. CMRL can scale down to cover lower bitrates as well, for which it employs linear predictive coding (LPC) module as its first autoencoder. The hybrid design integrates LPC and NWC by redefining LPC's quantization as a differentiable process, making the system training an end-to-end manner. The decoder of proposed system is with either one NWC (0.12 million parameters) in low to medium bitrate ranges (12 to 20 kbps) or two NWCs in the high bitrate (32 kbps). Although the decoding complexity is not yet as low as that of conventional speech codecs, it is significantly reduced from that of other neural speech coders, such as a WaveNet-based vocoder. For wide-band speech coding quality, our system yields comparable or superior performance to AMR-WB and Opus on TIMIT test utterances at low and medium bitrates. The proposed system can scale up to higher bitrates to achieve near transparent performance.
△ Less
Submitted 27 November, 2021; v1 submitted 26 March, 2021;
originally announced March 2021.
-
Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding
Authors:
Kai Zhen,
Mi Suk Lee,
Jongmo Sung,
Seungkwon Beack,
Minje Kim
Abstract:
Conventional audio coding technologies commonly leverage human perception of sound, or psychoacoustics, to reduce the bitrate while preserving the perceptual quality of the decoded audio signals. For neural audio codecs, however, the objective nature of the loss function usually leads to suboptimal sound quality as well as high run-time complexity due to the large model size. In this work, we pres…
▽ More
Conventional audio coding technologies commonly leverage human perception of sound, or psychoacoustics, to reduce the bitrate while preserving the perceptual quality of the decoded audio signals. For neural audio codecs, however, the objective nature of the loss function usually leads to suboptimal sound quality as well as high run-time complexity due to the large model size. In this work, we present a psychoacoustic calibration scheme to re-define the loss functions of neural audio coding systems so that it can decode signals more perceptually similar to the reference, yet with a much lower model complexity. The proposed loss function incorporates the global masking threshold, allowing the reconstruction error that corresponds to inaudible artifacts. Experimental results show that the proposed model outperforms the baseline neural codec twice as large and consuming 23.4% more bits per second. With the proposed method, a lightweight neural codec, with only 0.9 million parameters, performs near-transparent audio coding comparable with the commercial MPEG-1 Audio Layer III codec at 112 kbps.
△ Less
Submitted 31 December, 2020;
originally announced January 2021.
-
Compressed-Sensing based Beam Detection in 5G NR Initial Access
Authors:
Junmo Sung,
Brian L. Evans
Abstract:
To support millimeter wave (mmWave) frequency bands in cellular communications, both the base station and the mobile platform utilize large antenna arrays to steer narrow beams towards each other to compensate the path loss and improve communication performance. The time-frequency resource allocated for initial access, however, is limited, which gives rise to need for efficient approaches for beam…
▽ More
To support millimeter wave (mmWave) frequency bands in cellular communications, both the base station and the mobile platform utilize large antenna arrays to steer narrow beams towards each other to compensate the path loss and improve communication performance. The time-frequency resource allocated for initial access, however, is limited, which gives rise to need for efficient approaches for beam detection. For hybrid analog-digital beamforming (HB) architectures, which are used to reduce power consumption, we propose a compressed sensing (CS) based approach for 5G initial access beam detection that is for a HB architecture and that is compliant with the 3GPP standard. The CS-based approach is compared with the exhaustive search in terms of beam detection accuracy and by simulation is shown to outperform. Up to 256 antennas are considered, and the importance of a careful codebook design is reaffirmed.
△ Less
Submitted 2 May, 2020;
originally announced May 2020.
-
Efficient And Scalable Neural Residual Waveform Coding With Collaborative Quantization
Authors:
Kai Zhen,
Mi Suk Lee,
Jongmo Sung,
Seungkwon Beack,
Minje Kim
Abstract:
Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network model…
▽ More
Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters, significantly less than many other generative models.
△ Less
Submitted 13 February, 2020;
originally announced February 2020.
-
Artificial Intelligence Strategies for National Security and Safety Standards
Authors:
Erik Blasch,
James Sung,
Tao Nguyen,
Chandra P. Daniel,
Alisa P. Mason
Abstract:
Recent advances in artificial intelligence (AI) have lead to an explosion of multimedia applications (e.g., computer vision (CV) and natural language processing (NLP)) for different domains such as commercial, industrial, and intelligence. In particular, the use of AI applications in a national security environment is often problematic because the opaque nature of the systems leads to an inability…
▽ More
Recent advances in artificial intelligence (AI) have lead to an explosion of multimedia applications (e.g., computer vision (CV) and natural language processing (NLP)) for different domains such as commercial, industrial, and intelligence. In particular, the use of AI applications in a national security environment is often problematic because the opaque nature of the systems leads to an inability for a human to understand how the results came about. A reliance on 'black boxes' to generate predictions and inform decisions is potentially disastrous. This paper explores how the application of standards during each stage of the development of an AI system deployed and used in a national security environment would help enable trust. Specifically, we focus on the standards outlined in Intelligence Community Directive 203 (Analytic Standards) to subject machine outputs to the same rigorous standards as analysis performed by humans.
△ Less
Submitted 3 November, 2019;
originally announced November 2019.
-
Base Station Antenna Selection for Low-Resolution ADC Systems
Authors:
**seok Choi,
Junmo Sung,
Narayan Prasad,
Xiao-Feng Qi,
Brian L. Evans,
Alan Gatherer
Abstract:
This paper investigates antenna selection at a base station with large antenna arrays and low-resolution analog-to-digital converters. For downlink transmit antenna selection for narrowband channels, we show (1) a selection criterion that maximizes sum rate with zero-forcing precoding equivalent to that of a perfect quantization system; (2) maximum sum rate increases with number of selected antenn…
▽ More
This paper investigates antenna selection at a base station with large antenna arrays and low-resolution analog-to-digital converters. For downlink transmit antenna selection for narrowband channels, we show (1) a selection criterion that maximizes sum rate with zero-forcing precoding equivalent to that of a perfect quantization system; (2) maximum sum rate increases with number of selected antennas; (3) derivation of the sum rate loss function from using a subset of antennas; and (4) unlike high-resolution converter systems, sum rate loss reaches a maximum at a point of total transmit power and decreases beyond that point to converge to zero. For wideband orthogonal-frequency-division-multiplexing (OFDM) systems, our results hold when entire subcarriers share a common subset of antennas. For uplink receive antenna selection for narrowband channels, we (1) generalize a greedy antenna selection criterion to capture tradeoffs between channel gain and quantization error; (2) propose a quantization-aware fast antenna selection algorithm using the criterion; and (3) derive a lower bound on sum rate achieved by the proposed algorithm based on submodular functions. For wideband OFDM systems, we extend our algorithm and derive a lower bound on its sum rate. Simulation results validate theoretical analyses and show increases in sum rate over conventional algorithms.
△ Less
Submitted 30 June, 2019;
originally announced July 2019.
-
Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding
Authors:
Kai Zhen,
Jongmo Sung,
Mi Suk Lee,
Seungkwon Beack,
Minje Kim
Abstract:
Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. C…
▽ More
Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. CMRL differs from other DNN-based speech codecs, in that rather than modeling speech compression problem in a single large neural network, it optimizes a series of less-complicated modules in a two-phase training scheme. The proposed method shows better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture. As an end-to-end model, it takes raw PCM signals as an input, but is also compatible with linear predictive coding (LPC), showing better subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved by using only 0.9 million trainable parameters, a significantly less complex architecture than the other DNN-based codecs in the literature.
△ Less
Submitted 13 September, 2019; v1 submitted 18 June, 2019;
originally announced June 2019.
-
Hybrid Powerline/Wireless Diversity for Smart Grid Communications: Design Challenges and Real-time Implementation
Authors:
Junmo Sung,
Mostafa Sayed,
Mahmoud Elgenedy,
Brian L. Evans,
Naofal Al-Dhahir,
Il Han Kim,
Khurram Waheed
Abstract:
The demand for energy is growing at an unprecedented pace that is much higher than the energy generation capacity growth rate using both conventional and green technologies.In particular, the electric power sector is consistently rated among the most dynamic growth markets over all other energy markets. Distributed (decentralized) energy generation based on renewable energy sources is an efficient…
▽ More
The demand for energy is growing at an unprecedented pace that is much higher than the energy generation capacity growth rate using both conventional and green technologies.In particular, the electric power sector is consistently rated among the most dynamic growth markets over all other energy markets. Distributed (decentralized) energy generation based on renewable energy sources is an efficient and reliable solution to serve such huge energy demand growth [1]. However, to manage dynamic and complex distributed systems, a massive amount of data has to be measured, collected, exchanged and processed in real time. Smart grids manage an intelligent energy delivery network enabled two-way communications between data concentrators operated by utility companies and smart meters installed at the end users. In particular, dynamic power-grid loading and peak load management are the two main driving forces for bidirectional communications over the grid. Narrowband power line communications (NB-PLC) and wireless communications in the unlicensed frequency band (sub-1 GHz or 2.4 GHz) are the two main communications systems adopted to support the growing smart grid applications. Moreover, since NB-PLC and unlicensed wireless links experience channel and interference with markedly different statistics, transmitting the same information signal concurrently over both links significantly enhances the smart grid communications reliability. In this article, we compare various diversity combining schemes for simultaneous power line and wireless transmissions. Furthermore, we developed a real-time testbed for the hybrid PLC/wireless system to demonstrate the performance enhancement achieved by PLC/wireless diversity combining over a single link performance.
△ Less
Submitted 14 August, 2018;
originally announced August 2018.
-
On Psychoacoustically Weighted Cost Functions Towards Resource-Efficient Deep Neural Networks for Speech Denoising
Authors:
Kai Zhen,
Aswin Sivaraman,
Jongmo Sung,
Minje Kim
Abstract:
We present a psychoacoustically enhanced cost function to balance network complexity and perceptual performance of deep neural networks for speech denoising. While training the network, we utilize perceptual weights added to the ordinary mean-squared error to emphasize contribution from frequency bins which are most audible while ignoring error from inaudible bins. To generate the weights, we empl…
▽ More
We present a psychoacoustically enhanced cost function to balance network complexity and perceptual performance of deep neural networks for speech denoising. While training the network, we utilize perceptual weights added to the ordinary mean-squared error to emphasize contribution from frequency bins which are most audible while ignoring error from inaudible bins. To generate the weights, we employ psychoacoustic models to compute the global masking threshold from the clean speech spectra. We then evaluate the speech denoising performance of our perceptually guided neural network by using both objective and perceptual sound quality metrics, testing on various network structures ranging from shallow and narrow ones to deep and wide ones. The experimental results showcase our method as a valid approach for infusing perceptual significance to deep neural network operations. In particular, the more perceptually sensible enhancement in performance seen by simple neural network topologies proves that the proposed method can lead to resource-efficient speech denoising implementations in small devices without degrading the perceived signal fidelity.
△ Less
Submitted 29 January, 2018;
originally announced January 2018.