-
MusicRL: Aligning Music Generation to Human Preferences
Authors:
Geoffrey Cideron,
Sertan Girgin,
Mauro Verzetti,
Damien Vincent,
Matej Kastelic,
Zalán Borsos,
Brian McWilliams,
Victor Ungureanu,
Olivier Bachem,
Olivier Pietquin,
Matthieu Geist,
Léonard Hussenot,
Neil Zeghidour,
Andrea Agostinelli
Abstract:
We propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of musicality as well as the specific intention behind a caption are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a retro guitar solo or a techno pop beat). Not only this makes supervised training of such…
▽ More
We propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of musicality as well as the specific intention behind a caption are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a retro guitar solo or a techno pop beat). Not only this makes supervised training of such models challenging, but it also calls for integrating continuous human feedback in their post-deployment finetuning. MusicRL is a pretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discrete audio tokens finetuned with reinforcement learning to maximise sequence-level rewards. We design reward functions related specifically to text-adherence and audio quality with the help from selected raters, and use those to finetune MusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantial dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning from Human Feedback (RLHF), we train MusicRL-U, the first text-to-music model that incorporates human feedback at scale. Human evaluations show that both MusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RU combines the two approaches and results in the best model according to human raters. Ablation studies shed light on the musical attributes influencing human preferences, indicating that text adherence and quality only account for a part of it. This underscores the prevalence of subjectivity in musical appreciation and calls for further involvement of human listeners in the finetuning of music generation models.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
AudioPaLM: A Large Language Model That Can Speak and Listen
Authors:
Paul K. Rubenstein,
Chulayuth Asawaroengchai,
Duc Dung Nguyen,
Ankur Bapna,
Zalán Borsos,
Félix de Chaumont Quitry,
Peter Chen,
Dalia El Badawy,
Wei Han,
Eugene Kharitonov,
Hannah Muckenhirn,
Dirk Padfield,
James Qin,
Danny Rozenberg,
Tara Sainath,
Johan Schalkwyk,
Matt Sharifi,
Michelle Tadmor Ramanovich,
Marco Tagliasacchi,
Alexandru Tudor,
Mihajlo Velimirović,
Damien Vincent,
Jiahui Yu,
Yongqiang Wang,
Vicky Zayats
, et al. (5 additional authors not shown)
Abstract:
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the…
▽ More
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
SoundStorm: Efficient Parallel Audio Generation
Authors:
Zalán Borsos,
Matt Sharifi,
Damien Vincent,
Eugene Kharitonov,
Neil Zeghidour,
Marco Tagliasacchi
Abstract:
We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consist…
▽ More
We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.
△ Less
Submitted 16 May, 2023;
originally announced May 2023.
-
Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
Authors:
Eugene Kharitonov,
Damien Vincent,
Zalán Borsos,
Raphaël Marinier,
Sertan Girgin,
Olivier Pietquin,
Matt Sharifi,
Marco Tagliasacchi,
Neil Zeghidour
Abstract:
We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables…
▽ More
We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
AudioLM: a Language Modeling Approach to Audio Generation
Authors:
Zalán Borsos,
Raphaël Marinier,
Damien Vincent,
Eugene Kharitonov,
Olivier Pietquin,
Matt Sharifi,
Dominik Roblek,
Olivier Teboul,
David Grangier,
Marco Tagliasacchi,
Neil Zeghidour
Abstract:
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenizati…
▽ More
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.
△ Less
Submitted 25 July, 2023; v1 submitted 7 September, 2022;
originally announced September 2022.