Towards audio language modeling - an overview

Haibin Wu

{}^{1}

, Xuanjun Chen

{}^{1*}

, Yi-Cheng Lin

{}^{1*}

, Kai-wei Chang

{}^{1}

, Ho-Lam Chung

{}^{1}

,
Alexander H. Liu

{}^{2}

, Hung-yi Lee

{}^{1}

{}^{*}

Equal second contribution.

{}^{1}

National Taiwan University.

{}^{2}

Massachusetts Institute of Technology.

Abstract

Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency. Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio language models (LMs). Numerous high-performance neural audio codecs and codec-based LMs have been developed. The paper aims to provide a thorough and systematic overview of the neural audio codec models and codec-based LMs.

Index Terms:

Neural codec, codec-based language model

Refer to caption — Figure 1: Timeline of current neural codec models and codec-based language models.

I Introduction

Neural audio codec models were first introduced to compress audio for efficient data transmission. The encoder converts the audio into codec codes, which are then transmitted. The receiver then uses the codec decoder to reconstruct the audio using the received codes.

Language modeling has proven to be highly successful in the field of Natural Language Processing (NLP). Audio data encompasses not only textual content but also rich information about speaker timbre, emotion, and general audio, offering deeper possibilities for language model applications. Researchers, especially those in large companies with significant computational resources, recently leverage the potential of neural codecs [1, 2, 3, 4, 5, 6, 7, 8] as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio language models (LMs) [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. The current codec-based language models and codec models are summarized in Figure 1. These findings promptly garnered the community’s attention, sparking a fervor for develo** codecs tailored to audio language modeling. Numerous high-performance neural audio codec models and audio LMs have been developed.

An ideal codec should maintain content while preserving paralinguistic and speaker-related information. Similarly, a universal audio language model should be able to generalize across various audio types, such as speech, music, and general audio, covering a wide range of applications. The arms race in develo** codecs and audio LMs is still ongoing.

Given the significant advancements in codecs and audio language models over the past three years as shown in Figure 1, there has yet to be a comprehensive review comparing them and providing inspiration to the community. In this study, we aim to fill this research gap by thoroughly reviewing and comparing various existing neural codec models and audio codec-based language models. Firstly, we specifically conduct an in-depth analysis of six representative open-source neural codec models to cover their training methodologies, implementation settings, and training data. Secondly, we expand our analysis to include eleven diverse codec-based language models, examining how they utilize the codecs and the tasks to which they can be applied. Through this comprehensive review, we aim to offer the community insights into the diverse methodologies and potential directions in the field of neural codecs and codec-based language modeling.

II Comprehensive comparison for neural audio codec models

Codec models aim to compress and decompress speech signals efficiently. Traditional codecs are developed based on psycho-acoustics and speech synthesis [21, 22]. Recently, the neural codec models demonstrated highly effective for compression and signal reconstruction, outperforming traditional codecs. Considering the broad spectrum of codec models within the research community, each trained with its distinct configurations and training techniques, there is a clear need for a thorough examination that covers the training methodologies, implementation settings, and training data employed across these codec models. The six codec models have distinct training details, resulting in a collection of fifteen different codec models, as summarized in Table I.

II-A Brief method overview for codecs

SoundStream [2] stands as one of the pioneering implementations of neural codec models, embodying a classic neural codec architecture comprising encoder, quantizer, and decoder modules. It utilizes the streaming SEANets [23] as its encoder and decoder. The quantizer incorporates a speech enhancement system with a Residual Vector Quantization (RVQ) [24, 2] bottleneck to obtain parallel token streams. During training, the model parameters are optimized using a combination of reconstruction and adversarial loss. SoundStorm [3] is an improved version of SoundStream to achieve both efficiency and high-quality audio generation. It accomplishes this by employing an architecture specifically tailored to the hierarchical structure of audio tokens. Moreover, it pioneers a parallel, non-autoregressive decoding scheme, which relies on confidence-based strategies for residual vector-quantized token sequences.

Encodec [1] builds upon a framework similar to SoundStream. Nonetheless, it further augments its capabilities by integrating supplementary LSTM [25] layers and harnessing a Transformer-based language model [26] to model the RVQ codes, thereby amplifying its sequence modeling performance. Then, there is a stream of work aimed at making codec models more general and powerful. AudioDec [4] represents an enhanced version of Encodec, implementing a group convolution mechanism to facilitate the real-time operation of the streamable network while also harnessing the capabilities of HiFi-GAN [27] to effectively generate high-fidelity audio at a high sampling rate of 48 kHz.

In the AcademiCodec model introduced by [5], a novel technique known as group-residual vector quantization is presented. It employs multiple parallel RVQ groups. This technique is specifically tailored for generation tasks. It aims to enhance the reconstruction performance while using a limited number of codebooks, consequently achieving an impressively low bit rate per second (BPS). This low BPS is of utmost significance as it effectively addresses the challenge of lengthy speech tokens in speech language modeling, resulting in reduced sequence lengths.

SpeechTokenizer [7] is a unified speech tokenizer designed for speech language models. It implements an Encoder-Decoder architecture enhanced with RVQ. By integrating both semantic and acoustic tokens, SpeechTokenizer hierarchically separates various aspects of speech information across different RVQ layers. Specifically, SpeechTokenizer is designed to regularize the first RVQ layer to highlight semantic information by learning the Hubert tokens [28]. Using such techniques can enhance the disentanglement of information across different RVQ layers.

Descript-audio-codec (DAC) [8], a universal neural codec model, distinguishes itself through its exceptional ability to maintain high-fidelity audio quality across a wide spectrum of data types, encompassing general audio, music, and speech. It accomplishes this feature by employing a number of training techniques, such as periodic activation functions [29], enhanced residual vector quantization using factorized and L2-normalized codes, random quantizer dropout to preserve audio reconstruction quality, as well as refining adversarial and reconstruction loss during the training process. The authors highlight the crucial importance of the periodic activation function among the employed techniques.

Unlike most models focusing on the time domain, FunCodec [6] proposes a frequency-domain codec. The authors claim they can achieve comparable performance with fewer parameters and lower computation complexity. Meanwhile, it also finds that incorporating semantic information in the codec tokens improves speech quality at low bit rates.

II-B Comparison from methodology angles

We compare several techniques proposed by these codecs in Table II. The abbreviation “A-F” represents different codec models. Please refer to Table I for the corresponding model full name. The design of discriminators constitutes a pivotal element within codec models. Encodec initially introduces the Multi-scale-STFT Discriminator (MS-STFTD). In contrast to the multi-scale discriminator (MSD) proposed in MelGAN [24], which captures long-term dependencies, the multi-period discriminator (MPD) proposed in HiFi-GAN [30] exhibits a capacity to discern more nuanced periodic details. Consequently, AudioDec replaces the conventionally employed STFTD with a HiFi-GAN-based MPD, observing an enhancement in audio quality within their model. AcademiCodec integrates prior research efforts by incorporating the MS-STFTD from Encodec and both HiFi-GAN-based MPD and MSD. Both SpeechTokenizer and Funcodec adopt identical discriminators to AcademiCodec, with Funcodec offering a unified interface adaptable to any combination of these three discriminator types. DAC identifies that employing MSD and MPD alone generates audio displaying blurriness and artifacts. To address this, they propose the application of a multi-scale, multi-band STFT discriminator (MS-MB-STFTD) to improve phase modeling and mitigate aliasing artifacts.

SpeechTokenizer utilizes semantic tokens from Hubert L9 as a teacher for the RVQ process. This guidance enables the disentanglement of content information into the first layer of the tokenizer, while paralinguistic information is retained in subsequent layers. FunCodec seeks to integrate semantic information by combining, adding, or residualizing the audio codec with semantic tokens. The study reveals that including semantic tokens enhances audio quality, particularly with the residual inclusion method. Additionally, SpeechTokenizer and FunCodec utilize K-means to cluster samples in the first mini-batch for initializing the VQ codebook, leading to improved code utilization. DAC follows the approach of BigVGAN [31], employing snake activation [29] for trainable control over the frequency of periodic signals. AcademiCodec employs multiple RVQ codebooks (multiple residual groups) to represent intermediate features. They demonstrate that using multiple residual groups achieves good reconstruction performance while employing only a few codebooks. Encodec trains an additional small transformer model for entropy coding over the quantized units, which reduces bandwidth and accelerates encoding and decoding.

TABLE I: Codec information comparison. ”A-F” represents different neural codec models, where ”A” is SpeechTokenizer [7], ”B

\sim

” is AcademiCodec [5], ”C” is AudioDec [4], ”D

\sim

” is DAC [8], ”E

\sim

” is EnCodec [1], and ”F

\sim

” is FunCodec [6].

n_{c}

represents the codebook number, SR represents the sample rate, and BPS represents the bit rate in unit bits per second.

	Codec information	Training data	$n_{c}$	SR	BPS
A	16k	Librispeech	8	16	4
B1	hifi_16k_320d	LibriTTS	4	16	2
B2	hifi_16k_320d_large_uni	VCTK	4	16	2
B3	hifi_24k_320d	AISHELL	4	24	3
C	24k_320d	Valentini	8	24	6.4
D1	16k	Common Voice, DAPS	12	16	6
D2	24k	VCTK, MUSDB	32	24	24
D3	44k	Jamendo, AudioSet	9	44.1	8
E1	24k_1.5bps		2	24	1.5
E2	24k_3bps	Common Voice	4	24	3
E3	24k_6bps	DAPS, Jamendo	8	24	6
E4	24k_12bps	AudioSet, FSD50K	16	24	12
E5	24k_24bps		32	24	24
F1	en_libritts_16k_gr1nq32ds320		32	16	16
F2	en_libritts_16k_gr8nq32ds320	Subset of LibriTTS	32	16	16
F3	en_libritts_16k_nq32ds320		32	16	16
F4	en_libritts_16k_nq32ds640		32	16	8
F5	zh_en_16k_nq32ds320	25k hours collected data	32	16	16
F6	zh_en_16k_nq32ds640	(en and zh-cn)	32	16	8

TABLE II: Comparison between codec implementation strategy. SEM represents codec including semantic tokens. Snake represents the codec model that employs snake activation. MRG represents codec has multiple residual groups. noisy represents codec utilizes noisy data in training. LM represents the model including language model training. KM represents codec uses K-means to cluster samples as initialization of VQ codebook.

Codec	Discriminators	SEM	Snake	MRG	Noisy	LM	KM
A	MSD + MPD + MS-STFTD	✓	✗	✗	✗	✗	✓
B	MSD + MPD + MS-STFTD	✗	✗	✓	✗	✗	✗
C	MPD	✗	✗	✗	✗	✗	✗
D	MPD + MS-MB-STFTD	✗	✓	✗	✗	✗	✗
E	MS-STFTD	✗	✗	✗	✗	✓	✗
F	MSD + MPD + MS-STFTD	✓	✗	✗	✓	✗	✓

II-C Implementation details

We compare the codebook number, training data, sampling rate, and bit rate per second in Table I. From the training data perspective, SpeechTokenizer [7], AudioDec [4], and FunCodec [6] utilize only English speech dataset. AcademiCodec [5] incorporates bilingual speech datasets, including AISHELL for Chinese and LibriTTS and VCTK for English. Both DAC [8], and Encodec [1] encompass diverse modality data, including speech, music, and audio, in the training data.

III Current codec-based speech language models

As shown in Figure 2, the process of neural codec-based audio language modeling begins by converting context information, such as text and MIDI, into context codes, while simultaneously encoding the audio into codec codes. These context and codec codes are then employed in the language modeling phase to generate the desired target codec code sequence. Subsequently, the target codec code sequence is passed to the codec decoder to produce the audio output. The entire pipeline embodies an audio-to-audio modeling approach.

III-A Overview for codec-based LMs

AudioLM [9] is the pioneering model in introducing codec codes for language modeling, utilizing a hierarchical approach that encompasses two distinct stages. The first stage generates semantic tokens using a self-supervised w2v-BERT model [32]. These tokens are then leveraged in the second stage as conditioning elements to create acoustic tokens using a SoundStream neural codec [2].

VALL-E [12], VALL-E X [13], and SpeechX [17], all originate from Microsoft and are neural codec language models trained to generate discrete codes derived from EnCodec [1], based on textual or acoustic inputs. VALL-E can generate high-quality personalized speech with only a 3-second enrollment recording from an unseen speaker. Furthermore, VALL-E X can produce high-quality speech in the target language with just a single speech utterance in the source language as a prompt. Additionally, SpeechX introduces a unified framework to address not only zero-shot TTS but also various types of speech transformation tasks, including speech enhancement and speech editing.

What sets ViaLA [14], AudioPaLM [10], and LauraGPT [16] apart is their dual capability to generate both text and audio. VioLA tries to tackle the question “Is one decoder-only generative model all you need for speech recognition, synthesis, and translation?” by employing language modeling that integrates both text tokens and audio tokens (extracted by EnCodec [1]), along with the use of task IDs and language IDs. AudioPaLM constructs a unified vocabulary comprising both text and audio tokens. It is a decoder-only, autoregressive model capable of processing and generating both text and speech. Additionally, AudioPaLM’s initialization stems from PaLM-2 [33], a text-only language model. AudioPaLM’s approach to audio tokenization resembles that of AudioLM. Moreover, AudioPaLM adopts and extends the SoundStream model to SoundStorm [3]. LauraGPT [16] is a versatile language model built on a decoder-only text-based language model, Qwen-2B [34]. LauraGPT has the capability to process both audio and text inputs, generating outputs in either modality. LauraGPT encodes input audio into continuous representations using a Conformer encoder and decodes output audio using FunCodec [6] discrete codes. The authors claim this specific audio features design for inputs and outputs will result in improved performance for speech generation using some preliminary experimental results.

UniAudio [15] utilizes language modeling to generate a wide range of audio types, including speech, sounds, music, and singing, using textual or acoustic tokens as inputs. UniAudio stands out for its ability to enhance autoregressive prediction speed by introducing a multi-scale Transformer model [35], which employs a large global transformer to predict the first-layer codec codes and a small local transformer to predict the codec codes for the subsequent codec layers. The codec model in UniAudio is revised from EnCodec.

Additionally, there are other codec-based language models designed for sound modeling. AudioGen [20] trained a SoundStream model to get audio tokens and subsequently trained a language model to utilize textual features as conditions for generating audio tokens. MusicLM [11] follows a training strategy similar to AudioLM but extends its scope to encompass music features. It approaches the task of conditional music generation through a hierarchical sequence-to-sequence modeling approach. Initially, it utilizes music tokens from Mulan [36] to generate semantic tokens from the w2v-BERT model. Subsequently, it employs both music tokens and semantic tokens to generate acoustic features through Soundstream. MusicGen [18] is a music language model designed to work with EnCodec discrete tokens. It accepts textual descriptions or melodic features as input conditions to generate tokens, which can be reconstructed to high-fidelity music.

TABLE III: Codec-based language models comparison. ”T” means text, ”AUD” means audio, ”P” means phoneme, and ”M” means MIDI.

CLM	Task	Input	Output	Codec
AudioLM [9]	SC, PC	AUD	AUD	SoundStream [2]
AudioGen [20]	AC	AUD, T	AUD	SoundStream [2]
VALL-E [12]	TTS	AUD, T	AUD	EnCodec [1]
MusicLM [11]	MG	AUD, T	AUD	SoundStream [2]
VALL-E X [13]	TTS, S2ST	AUD,T	AUD	EnCodec [1]
VioLA [14]	ASR, S2TT, TTS, MT	AUD,T	AUD,T	EnCodec [1]
MusicGen [18]	MG, SG	AUD	AUD	EnCodec [1]
AudioPaLM [10]	ASR, S2TT, TTS, MT	AUD,T	AUD,T	SoundStorm [3]
SpeechX [17]	SE, SR, TSE, TTS, SPED	AUD,T	AUD	EnCodec [1]
LauraGPT [16]	ASR, S2TT, TTS, MT, SE AAC, SER, SLU	AUD,T	AUD, T	FunCodec [6]
UniAudio [15]	TTS, VC, SE, TSE, SVS TTSO, TTM, AUED SD, ITTS, SPED	P, M, AUD,T	AUD	EnCodec [1]

Another branch of speech language modeling aims to utilize discrete units obtained by quantizing self-supervised speech representations. While these discrete units contain rich acoustic and linguistic information [37], they lack speaker and paralinguistic information [38]. This research direction focuses on modeling the semantics of speech, with the optional use of encoders to learn about speaker characteristics and prosody. Pioneering work is speech-resynthesis [38], which utilizes these discrete units in conjunction with prosody and speaker encoders to encode speech into low-bitrate codes. These codes can then be resynthesized into a speech signal with a decoder to achieve low-bitrate transmission. Additionally, these discrete units can be regarded as “pseudo-text,” serving as a foundation for training textless speech language models. Notable examples include GSLM [39], pGSLM [40], dGSLM [41], and TWIST [42]. By engaging in the pre-trained task of next-token prediction, these speech LMs perform spoken language modeling and can conduct the task of speech continuation. In the field of speech translation, recent advancements have been made possible through these discrete units. [43] pre-trained a Unit mBART combined with a wav2vec 2.0 [44] encoder to directly predict the translated discrete units. UnitY [45] further incorporates text modality to enhance speech translation. The Seamless models [46, 47] integrate the UnitY framework to perform expressive and streaming speech-to-text and speech-to-speech translation. With the development of these powerful speech LMs, researchers have begun to explore the use of prompting on speech LMs for various speech processing tasks, including prompt tuning [48, 49, 50], in-context learning [51], and instruction tuning [52, 53].

III-B Comparison for Codec-based audio language models

In Table III, we compare the inputs, outputs, and downstream tasks of different codec-based language models. We also summarize that the downstream tasks conducted by different codec-based language models: Speech Continuation (SC), Piano Continuation (PC), Audio Continuation (AC), Text-to-Speech (TTS), Music Generation (MG), Stereophonic Generation (SG), Speech to Speech Translation (S2ST), Automatic Speech Recognition (ASR), Spoken Language Understanding (SLU), Automated Audio Captioning (AAC), Speech to Text Translation (S2TT), Machine Translation (MT), Speech Enhancement (SE), Speech Removal (SR), Target Speaker Extraction (TSE), Speech Editing (SPED), Voice Conversion (VC), Singing Voice Synthesis (SVS), Text-to-Sound (TTSO), Text-to-Music (TTM), Audio Editing (AUED), Speech Dereverb (SD), Instructed TTS (ITTS). Finally, we show the codec models adopted by different LMs.

IV Conclusion

The paper fills the research blank to review the neural codec models and LMs built upon them. We hope the comprehensive review and comparisons can inspire future research works to boost the development of neural codec models and codec-based LMs.

References

[1] Alexandre Défossez et al., “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
[2] Neil Zeghidour et al., “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
[3] Zalán Borsos et al., “Soundstorm: Efficient parallel audio generation,” arXiv preprint arXiv:2305.09636, 2023.
[4] Yi-Chiao Wu et al., “Audiodec: An open-source streaming high-fidelity neural audio codec,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[5] Dongchao Yang et al., “Hifi-codec: Group-residual vector quantization for high fidelity audio codec,” arXiv preprint arXiv:2305.02765, 2023.
[6] Zhihao Du, Shiliang Zhang, Kai Hu, and Siqi Zheng, “Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,” arXiv preprint arXiv:2309.07405, 2023.
[7] Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu, “Speechtokenizer: Unified speech tokenizer for speech large language models,” arXiv preprint arXiv:2308.16692, 2023.
[8] Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar, “High-fidelity audio compression with improved rvqgan,” arXiv preprint arXiv:2306.06546, 2023.
[9] Zalán Borsos et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[10] Paul K Rubenstein et al., “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023.
[11] Andrea Agostinelli et al., “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
[12] Chengyi Wang et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
[13] Ziqiang Zhang et al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.
[14] Tianrui Wang et al., “Viola: Unified codec language models for speech recognition, synthesis, and translation,” arXiv preprint arXiv:2305.16107, 2023.
[15] Dongchao Yang et al., “Uniaudio: An audio foundation model toward universal audio generation,” arXiv preprint arXiv:2310.00704, 2023.
[16] Qian Chen et al., “Lauragpt: Listen, attend, understand, and regenerate audio with gpt,” arXiv preprint arXiv:2310.04673, 2023.
[17] Xiaofei Wang et al., “Speechx: Neural codec language model as a versatile speech transformer,” arXiv preprint arXiv:2308.06873, 2023.
[18] Jade Copet et al., “Simple and controllable music generation,” arXiv preprint arXiv:2306.05284, 2023.
[19] Gael Le Lan et al., “Stack-and-delay: a new codebook pattern for music generation,” arXiv preprint arXiv:2309.08804, 2023.
[20] Felix Kreuk et al., “Audiogen: Textually guided audio generation,” arXiv preprint arXiv:2209.15352, 2022.
[21] Jean-Marc Valin et al., “Rfc 6716: Definition of the opus audio codec,” 2012.
[22] Martin Dietz et al., “Overview of the evs codec architecture,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5698–5702.
[23] Marco Tagliasacchi et al., “Seanet: A multi-modal speech enhancement network,” arXiv preprint arXiv:2009.02095, 2020.
[24] Kundan Kumar et al., “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in neural information processing systems, vol. 32, 2019.
[25] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[26] Ashish Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[27] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
[28] Wei-Ning Hsu et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
[29] Liu Ziyin, Tilman Hartwig, and Masahito Ueda, “Neural networks fail to learn periodic functions and how to fix it,” Advances in Neural Information Processing Systems, vol. 33, pp. 1583–1594, 2020.
[30] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020.
[31] Sang-gil Lee, Wei **, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon, “Bigvgan: A universal neural vocoder with large-scale training,” arXiv preprint arXiv:2206.04658, 2022.
[32] Yu-An Chung et al., “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 244–250.
[33] Rohan Anil et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
[34] **ze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
[35] Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis, “Megabyte: Predicting million-byte sequences with multiscale transformers,” arXiv preprint arXiv:2305.07185, 2023.
[36] Qingqing Huang, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li, and Daniel PW Ellis, “Mulan: A joint embedding of music audio and natural language,” arXiv preprint arXiv:2208.12415, 2022.
[37] Dan Wells, Hao Tang, and Korin Richmond, “Phonetic Analysis of Self-supervised Representations of English Speech,” in Proc. Interspeech 2022, 2022, pp. 3583–3587.
[38] Adam Polyak et al., “Speech resynthesis from discrete disentangled self-supervised representations,” in Interspeech, 2021, pp. 3615–3619.
[39] Kushal Lakhotia et al., “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021.
[40] Eugene Kharitonov et al., “Text-free prosody-aware generative spoken language modeling,” arXiv preprint arXiv:2109.03264, 2021.
[41] Tu Anh Nguyen et al., “Generative spoken dialogue language modeling,” Transactions of the Association for Computational Linguistics, vol. 11, pp. 250–266, 2023.
[42] Michael Hassid et al., “Textually pretrained speech language models,” arXiv preprint arXiv:2305.13009, 2023.
[43] Sravya Popuri et al., “Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation,” in Proc. Interspeech 2022, 2022, pp. 5195–5199.
[44] Alexei Baevski et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
[45] Hirofumi Inaguma et al., “Unity: Two-pass direct speech-to-speech translation with discrete units,” arXiv preprint arXiv:2212.08055, 2022.
[46] Loïc Barrault et al., “Seamlessm4t-massively multilingual & multimodal machine translation,” arXiv preprint arXiv:2308.11596, 2023.
[47] Loïc Barrault et al., “Seamless: Multilingual expressive and streaming speech translation,” arXiv preprint arXiv:2312.05187, 2023.
[48] Kai-Wei Chang et al., “An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks,” in Proc. Interspeech 2022, 2022, pp. 5005–5009.
[49] Kai-Wei Chang et al., “Speechprompt v2: Prompt tuning for speech classification tasks,” arXiv preprint arXiv:2303.00733, 2023.
[50] Haibin Wu, Kai-Wei Chang, Yuan-Kuei Wu, and Hung-yi Lee, “Speechgen: Unlocking the generative power of speech language models with prompts,” arXiv preprint arXiv:2306.02207, 2023.
[51] Ming-Hao Hsu et al., “An exploration of in-context learning for speech language model,” arXiv preprint arXiv:2310.12477, 2023.
[52] Chun-Yi Kuan, Chen-An Li, et al., “Towards general-purpose text-instruction-guided voice conversion,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8.
[53] Chien-yu Huang, Ke-Han Lu, et al., “Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,” arXiv preprint arXiv:2309.09510, 2023.