SLM: Bridge the thin gap between speech and text foundation models
Authors:
Mingqiu Wang,
Wei Han,
Izhak Shafran,
Zelin Wu,
Chung-Cheng Chiu,
Yuan Cao,
Yongqiang Wang,
Nanxin Chen,
Yu Zhang,
Hagen Soltau,
Paul Rubenstein,
Lukas Zilka,
Dian Yu,
Zhong Meng,
Golan Pundak,
Nikhil Siddhartha,
Johan Schalkwyk,
Yonghui Wu
Abstract:
We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achiev…
▽ More
We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achieve strong performance on conventional tasks such as speech recognition (ASR) and speech translation (AST), but also introduces the novel capability of zero-shot instruction-following for more diverse tasks: given a speech input and a text instruction, SLM is able to perform unseen generation tasks including contextual biasing ASR using real-time context, dialog generation, speech continuation, and question answering, etc. Our approach demonstrates that the representational gap between pretrained speech and language models might be narrower than one would expect, and can be bridged by a simple adaptation mechanism. As a result, SLM is not only efficient to train, but also inherits strong capabilities already acquired in foundation models of different modalities.
△ Less
Submitted 29 September, 2023;
originally announced October 2023.
AudioPaLM: A Large Language Model That Can Speak and Listen
Authors:
Paul K. Rubenstein,
Chulayuth Asawaroengchai,
Duc Dung Nguyen,
Ankur Bapna,
Zalán Borsos,
Félix de Chaumont Quitry,
Peter Chen,
Dalia El Badawy,
Wei Han,
Eugene Kharitonov,
Hannah Muckenhirn,
Dirk Padfield,
James Qin,
Danny Rozenberg,
Tara Sainath,
Johan Schalkwyk,
Matt Sharifi,
Michelle Tadmor Ramanovich,
Marco Tagliasacchi,
Alexandru Tudor,
Mihajlo Velimirović,
Damien Vincent,
Jiahui Yu,
Yongqiang Wang,
Vicky Zayats
, et al. (5 additional authors not shown)
Abstract:
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the…
▽ More
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples
△ Less
Submitted 22 June, 2023;
originally announced June 2023.