Search | arXiv e-print repository

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Authors: Ge Zhu, Yutong Wen, Marc-André Carbonneau, Zhiyao Duan

Abstract: Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining wit… ▽ More Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fréchet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/ △ Less

Submitted 18 November, 2023; v1 submitted 14 November, 2023; originally announced November 2023.

Comments: Accepted at NeurIPS Workshop: Machine Learning for Audio (Camera Ready)

arXiv:2307.06040 [pdf, other]

Rhythm Modeling for Voice Conversion

Authors: Benjamin van Niekerk, Marc-André Carbonneau, Herman Kamper

Abstract: Voice conversion aims to transform source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic-an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representatio… ▽ More Voice conversion aims to transform source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic-an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representations, we first divide source audio into segments approximating sonorants, obstruents, and silences. Then we model rhythm by estimating speaking rate or the duration distribution of each segment type. Finally, we match the target speaking rate or rhythm by time-stretching the speech segments. Experiments show that Urhythmic outperforms existing unsupervised methods in terms of quality and prosody. Code and checkpoints: https://github.com/bshall/urhythmic. Audio demo page: https://ubisoft-laforge.github.io/speech/urhythmic. △ Less

Submitted 12 July, 2023; originally announced July 2023.

Comments: 5 pages, 4 figures, 4 tables, submitted to IEEE Signal Processing Letters

arXiv:2111.02392 [pdf, other]

doi 10.1109/ICASSP43922.2022.9746484

A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion

Authors: Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaïdi, Mathew Baas, Hugo Seuté, Herman Kamper

Abstract: The goal of voice conversion is to transform source speech into a target voice, kee** the content unchanged. In this paper, we focus on self-supervised representation learning for voice conversion. Specifically, we compare discrete and soft speech units as input features. We find that discrete representations effectively remove speaker information but discard some linguistic content - leading to… ▽ More The goal of voice conversion is to transform source speech into a target voice, kee** the content unchanged. In this paper, we focus on self-supervised representation learning for voice conversion. Specifically, we compare discrete and soft speech units as input features. We find that discrete representations effectively remove speaker information but discard some linguistic content - leading to mispronunciations. As a solution, we propose soft speech units. To learn soft units, we predict a distribution over discrete speech units. By modeling uncertainty, soft units capture more content information, improving the intelligibility and naturalness of converted speech. Samples available at https://ubisoft-laforge.github.io/speech/soft-vc/. Code available at https://github.com/bshall/soft-vc/. △ Less

Submitted 8 June, 2022; v1 submitted 3 November, 2021; originally announced November 2021.

Comments: 5 pages, 2 figures, 2 tables. Accepted at ICASSP 2022

arXiv:2108.02271 [pdf, ps, other]

doi 10.21437/Interspeech.2022-10761

Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Authors: Julian Zaïdi, Hugo Seuté, Benjamin van Niekerk, Marc-André Carbonneau

Abstract: This paper presents Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art for cross-speaker prosody transfer on any text. This is one of the most challenging, and rarely directly addressed, task in speech synthesis, especially for highly expressive data. Daft-Exprt uses FiLM conditioning layers to strategically inject different prosodic information in all parts of the architect… ▽ More This paper presents Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art for cross-speaker prosody transfer on any text. This is one of the most challenging, and rarely directly addressed, task in speech synthesis, especially for highly expressive data. Daft-Exprt uses FiLM conditioning layers to strategically inject different prosodic information in all parts of the architecture. The model explicitly encodes traditional low-level prosody features such as pitch, loudness and duration, but also higher level prosodic information that helps generating convincing voices in highly expressive styles. Speaker identity and prosodic information are disentangled through an adversarial training strategy that enables accurate prosody transfer across speakers. Experimental results show that Daft-Exprt significantly outperforms strong baselines on inter-text cross-speaker prosody transfer tasks, while yielding naturalness comparable to state-of-the-art expressive models. Moreover, results indicate that the model discards speaker identity information from the prosody representation, and consistently generate speech with the desired voice. We publicly release our code and provide speech samples from our experiments. △ Less

Submitted 5 April, 2022; v1 submitted 4 August, 2021; originally announced August 2021.

Comments: Submitted to Interspeech 2022, 5 pages, 5 figures, 2 tables

Journal ref: Proc. Interspeech (2022) 4591-4595

arXiv:2103.12177 [pdf, other]

Energy Disaggregation using Variational Autoencoders

Authors: Antoine Langevin, Marc-André Carbonneau, Mohamed Cheriet, Ghyslain Gagnon

Abstract: Non-intrusive load monitoring (NILM) is a technique that uses a single sensor to measure the total power consumption of a building. Using an energy disaggregation method, the consumption of individual appliances can be estimated from the aggregate measurement. Recent disaggregation algorithms have significantly improved the performance of NILM systems. However, the generalization capability of the… ▽ More Non-intrusive load monitoring (NILM) is a technique that uses a single sensor to measure the total power consumption of a building. Using an energy disaggregation method, the consumption of individual appliances can be estimated from the aggregate measurement. Recent disaggregation algorithms have significantly improved the performance of NILM systems. However, the generalization capability of these methods to different houses as well as the disaggregation of multi-state appliances are still major challenges. In this paper we address these issues and propose an energy disaggregation approach based on the variational autoencoders framework. The probabilistic encoder makes this approach an efficient model for encoding information relevant to the reconstruction of the target appliance consumption. In particular, the proposed model accurately generates more complex load profiles, thus improving the power signal reconstruction of multi-state appliances. Moreover, its regularized latent space improves the generalization capabilities of the model across different houses. The proposed model is compared to state-of-the-art NILM approaches on the UK-DALE and REFIT datasets, and yields competitive results. The mean absolute error reduces by 18% on average across all appliances compared to the state-of-the-art. The F1-Score increases by more than 11%, showing improvements for the detection of the target appliance in the aggregate measurement. △ Less

Submitted 19 July, 2021; v1 submitted 22 March, 2021; originally announced March 2021.

Comments: 13 pages, 2 figures, results for the REFIT dataset added

Showing 1–5 of 5 results for author: Carbonneau, M