-
Controllable Prosody Generation With Partial Inputs
Authors:
Dan Andrei Iliescu,
Devang Savita Ram Mohan,
Tian Huey Teh,
Zack Hodari
Abstract:
We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model genera…
▽ More
We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model generates the missing features. We propose a model that is specifically designed to encode partial prosodic features and output complete audio. We show empirically that our model displays two essential qualities of a human-in-the-loop control mechanism: efficiency and robustness. With even a very small number of input values (~4), our model enables users to improve the quality of the output significantly in terms of listener preference (4:1).
△ Less
Submitted 15 April, 2024; v1 submitted 14 March, 2023;
originally announced March 2023.
-
Disentangling Domain and Content
Authors:
Dan Andrei Iliescu,
Aliaksei Mikhailiuk,
Damon Wischik,
Rafal Mantiuk
Abstract:
Many real-world datasets can be divided into groups according to certain salient features (e.g. grou** images by subject, grou** text by font, etc.). Often, machine learning tasks require that these features be represented separately from those manifesting independently of the grou**. For example, image translation entails changing the style of an image while preserving its content. We forma…
▽ More
Many real-world datasets can be divided into groups according to certain salient features (e.g. grou** images by subject, grou** text by font, etc.). Often, machine learning tasks require that these features be represented separately from those manifesting independently of the grou**. For example, image translation entails changing the style of an image while preserving its content. We formalize these two kinds of attributes as two complementary generative factors called "domain" and "content", and address the problem of disentangling them in a fully unsupervised way. To achieve this, we propose a principled, generalizable probabilistic model inspired by the Variational Autoencoder. Our model exhibits state-of-the-art performance on the composite task of generating images by combining the domain of one input with the content of another. Distinctively, it can perform this task in a few-shot, unsupervised manner, without being provided with explicit labelling for either domain or content. The disentangled representations are learned through the combination of a group-wise encoder and a novel domain-confusion loss.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
Training a Task-Specific Image Reconstruction Loss
Authors:
Aamir Mustafa,
Aliaksei Mikhailiuk,
Dan Andrei Iliescu,
Varun Babbar,
Rafal K. Mantiuk
Abstract:
The choice of a loss function is an important factor when training neural networks for image restoration problems, such as single image super resolution. The loss function should encourage natural and perceptually pleasing results. A popular choice for a loss is a pre-trained network, such as VGG, which is used as a feature extractor for computing the difference between restored and reference imag…
▽ More
The choice of a loss function is an important factor when training neural networks for image restoration problems, such as single image super resolution. The loss function should encourage natural and perceptually pleasing results. A popular choice for a loss is a pre-trained network, such as VGG, which is used as a feature extractor for computing the difference between restored and reference images. However, such an approach has multiple drawbacks: it is computationally expensive, requires regularization and hyper-parameter tuning, and involves a large network trained on an unrelated task. Furthermore, it has been observed that there is no single loss function that works best across all applications and across different datasets. In this work, we instead propose to train a set of loss functions that are application specific in nature. Our loss function comprises a series of discriminators that are trained to detect and penalize the presence of application-specific artifacts. We show that a single natural image and corresponding distortions are sufficient to train our feature extractor that outperforms state-of-the-art loss functions in applications like single image super resolution, denoising, and JPEG artifact removal. Finally, we conclude that an effective loss function does not have to be a good predictor of perceived image quality, but instead needs to be specialized in identifying the distortions for a given restoration method.
△ Less
Submitted 17 October, 2021; v1 submitted 26 March, 2021;
originally announced March 2021.