\interspeechcameraready\name

[affiliation=2]VasileiosMoschopoulos \name[affiliation=2]ThanasisKotsiopoulos \name[affiliation=1]PabloPeso Parada \name[affiliation=2]KonstantinosNikiforidis \name[affiliation=3]AlexandrosStergiadis \name[affiliation=2]GerasimosPapakostas \name[affiliation=1]Md AsifJalal \name[affiliation=1]JisiZhang \name[affiliation=2]AnastasiosDrosou \name[affiliation=1]KarthikeyanSaravanan

Exploring compressibility of transformer based text-to-music (TTM) models

Abstract

State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the various components of the TTM model (encoder, generative model and the decoder). Leveraging these methods we create TinyTTM (89.2M params) that achieves a FAD of 3.66 and KL of 1.32 on MusicBench dataset, better than MusicGen-Small (557.6M params) but not lower than MusicGen-small fine-tuned on MusicBench.

keywords:
Text-To-Music, Distillation, Compression.

1 Introduction

Recent developments in generative AI have facilitated the emergence of various novel applications, including text-to-music (TTM) systems to generate music content from textual prompts. These systems allow users to efficiently craft music compositions that fulfill their specific requirements, reducing the necessity for expertise in music composition. Recent works on this topic, such as MusicGen [1] or MusicLDM [2], have demonstrated the capability of generating high-quality music samples, highlighting the potential of TTM.

TTM models consist of three primary components: a text encoder, a generative model and a decoder. The text encoder is responsible for translating the input text into an embedding representation. Different text encoders are employed in the literature such as a contrastive language-audio pretrained model based on Mulan [3] or CLAP [4] employed in MusicLM [5], Stable Audio[6], AudioLDM[7] and AudioLDM2 [8] or language model based [9, 10] integrated in MAGNeT [11], MusigGen [1], Mustango [12] and AudioLDM2 [8]. There are two primary approaches used for the generative model: autoregressive/nonautoregressive transformers as in [5, 1, 11] where the transformer produces a sequence of discrete tokens conditioned to the encoder embedding via the cross-attention mechanism; and diffusion based models [7, 8, 2, 6, 12] which learn to denoise a randomly generated latent space conditioned to the encoder embedding. The decoder, accepts the generative model output and transforms it into a time-domain signal via audio codecs [5, 1, 11, 7, 8, 2, 12, 6].

Contemporary TTM models [1, 7] are large with 100s of millions of parameters. Even the smaller variants of [1], and [7] have 557.6M (authors mention 300M parameters but this refers to autoregressive transformer only), and 428M parameters respectively. It is essential to minimise the size of such models to deploy on resource-constrained devices. In this work we propose and analyse different methods to reduce the size of each component in TTM. A popular approach towards reducing model size is knowledge distillation, where the “knowledge” of the original model often referred to as the “teacher” is distilled into a smaller target size model referred to as the “student”. This is typically quantified with the Kullback-Leibler divergence loss [13, 14], which measures the discrepancy between teacher and student output distributions. Furthermore, the intermediate layer outputs of the teacher model can be also exploited [15, 16, 17, 18], providing the student model with access to deeper knowledge and encouraging finer alignment between the models’ internal representations. Another common strategy is weight transfer [19], where part of the teacher weight matrices are copied directly to the student model [20, 21, 22], to accelerate learning and improve final performance [16].

Figure 1 summarizes this work, where we start from a pretrained MusicGen-Small model and through methods presented in this paper, reduce the model size from 557.6M parameters down to 89.2M parameters. This reduction process is achieved by individually optimizing each of the three components of the TTM model through a method that combines distillation and fine-tuning techniques using MusicBench [12] dataset.

Refer to caption
Figure 1: MusicGen-Small vs. proposed TinyTTM.

Contributions of this work are:

  • The creation of a tiny TTM system containing 89.2M parameters, for first time to the best of our knowledge, demonstrating performance comparable to larger systems and more suitable for deployment on resource-constrained devices.

  • A knowledge distillation technique applying adaptive learning using stochastic loss sampling is proposed. In this method, we give new importance to different parts of the loss - like teacher loss, ground-truth loss, and model intermediate loss - with each step of distillation. This loss weight scheduling helps us perform better than existing methods where the weights given to each part of the loss is fixed. In contrast to fixed weighting, our method improves representation learning by adding uncertainty to how we pick the losses. Two different distributions are explored to model the loss weight scheduling.

2 TinyTTM

This section describes the MusicGen [1] components and the methods proposed to effectively reduce the model size.

2.1 Encoder

The encoder is used to create the embedding vector for conditioning the generator. This work uses T5 (Text-to-Text Transformer) [9] following the structure proposed by MusicGen[1].

The T5 model, based on a Transformer encoder-decoder architecture, is offered in various sizes [23]. This paper focuses predominantly on T5-tiny (utilized in TinyTTM) and compares the performance with T5-base (implemented in MusicGen-small). We evaluate the T5-tiny pretrained model111https://huggingface.co/google/t5-efficient-tiny and fine-tune it on text containing music descriptions from MusicBench. In order to fine-tune the model we leverage the span-based MLM objective and cross-entropy loss [9]. During our investigation, we specifically explore the strategy of freezing the decoder layers during training, as these layers are not utilized during inference. Additionally, to further reduce the four layer T5-Tiny encoder we explore drop** the last layers.

2.2 Generative Transformer Language Model (LM)

The Language Model (LM) in MusicGen [1] is trained using cross-entropy loss between the predicted output logits and the ground truth audio tokens generated by EnCodec. We define the student loss LSsubscript𝐿𝑆L_{S}italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT as:

LS=1Kk=1Kt=1Tc=1Cyk,t,clog(pk,t,c)subscript𝐿𝑆1𝐾superscriptsubscript𝑘1𝐾superscriptsubscript𝑡1𝑇superscriptsubscript𝑐1𝐶subscript𝑦𝑘𝑡𝑐subscript𝑝𝑘𝑡𝑐L_{S}=-\frac{1}{K}\sum_{k=1}^{K}\sum_{t=1}^{T}\sum_{c=1}^{C}y_{k,t,c}\log(p_{k% ,t,c})italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_k , italic_t , italic_c end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_k , italic_t , italic_c end_POSTSUBSCRIPT ) (1)

where T𝑇Titalic_T are the temporal steps, K𝐾Kitalic_K the number of codebooks, C𝐶Citalic_C the cardinality of each codebook, and yt,csubscript𝑦𝑡𝑐y_{t,c}italic_y start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT and pt,csubscript𝑝𝑡𝑐p_{t,c}italic_p start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT represent the ground truth and predicted probability of the cthsuperscript𝑐𝑡c^{th}italic_c start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT token at time t𝑡titalic_t, respectively. Batch dimension is ommited for simplicity.

In this work, prior knowledge from a teacher model is leveraged to guide the student model. This is accomplished through the utilization of both the teacher LM outputs, as well as the hidden representations within the transformer architecture. We define the teacher loss LTsubscript𝐿𝑇L_{T}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT as KL divergence based [19, 16]:

LT=1Kk=1Kt=1Tc=1Cqk,t,clog(qk,t,cpk,t,c)subscript𝐿𝑇1𝐾superscriptsubscript𝑘1𝐾superscriptsubscript𝑡1𝑇superscriptsubscript𝑐1𝐶subscript𝑞𝑘𝑡𝑐subscript𝑞𝑘𝑡𝑐subscript𝑝𝑘𝑡𝑐L_{T}=-\frac{1}{K}\sum_{k=1}^{K}\sum_{t=1}^{T}\sum_{c=1}^{C}q_{k,t,c}\log\left% (\frac{q_{k,t,c}}{p_{k,t,c}}\right)italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_k , italic_t , italic_c end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_q start_POSTSUBSCRIPT italic_k , italic_t , italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k , italic_t , italic_c end_POSTSUBSCRIPT end_ARG ) (2)

where qk,t,csubscript𝑞𝑘𝑡𝑐q_{k,t,c}italic_q start_POSTSUBSCRIPT italic_k , italic_t , italic_c end_POSTSUBSCRIPT, pk,t,csubscript𝑝𝑘𝑡𝑐p_{k,t,c}italic_p start_POSTSUBSCRIPT italic_k , italic_t , italic_c end_POSTSUBSCRIPT are the predicted probability of the cthsuperscript𝑐𝑡c^{th}italic_c start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT token at time t𝑡titalic_t for the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT codebook according to teacher and student model, respectively. The teacher loss LTsubscript𝐿𝑇L_{T}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is the averaged KL divergence across all K𝐾Kitalic_K codebooks. Additionally, the teacher intermediate transformer layer outputs y^Tsubscript^𝑦𝑇\hat{y}_{T}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are being utilized as guidance to align the student corresponding outputs y^Ssubscript^𝑦𝑆\hat{y}_{S}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT:

LMSE(y^T,y^S)=1Ktrk=1Ktr|y^T,m(k)y^S,k|2subscript𝐿𝑀𝑆𝐸subscript^𝑦𝑇subscript^𝑦𝑆1subscript𝐾𝑡𝑟superscriptsubscript𝑘1subscript𝐾𝑡𝑟superscriptsubscript^𝑦𝑇𝑚𝑘subscript^𝑦𝑆𝑘2L_{MSE}(\hat{y}_{T},\hat{y}_{S})=\frac{1}{K_{tr}}\sum_{k=1}^{K_{tr}}|\hat{y}_{% T,m(k)}-\hat{y}_{S,k}|^{2}italic_L start_POSTSUBSCRIPT italic_M italic_S italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_T , italic_m ( italic_k ) end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_S , italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (3)

where Ktrsubscript𝐾𝑡𝑟K_{tr}italic_K start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT denotes the number of student transformer layers, and ||2|\cdot|^{2}| ⋅ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT signifies the squared Euclidean norm. The additional subscript in y^T,m(k)subscript^𝑦𝑇𝑚𝑘\hat{y}_{T,m(k)}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_T , italic_m ( italic_k ) end_POSTSUBSCRIPT and y^S,ksubscript^𝑦𝑆𝑘\hat{y}_{S,k}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_S , italic_k end_POSTSUBSCRIPT represent the layer index and m(k)𝑚𝑘m(k)italic_m ( italic_k ) represents a function that maps the teacher transformer layer ids to the corresponding ones from the student, based on the layer matching strategy that is employed. The final loss is given by the weighted sum of the aforementioned loss terms:

LLM=lsLS+ltLT+lmLMSEsubscript𝐿𝐿𝑀subscript𝑙𝑠subscript𝐿𝑆subscript𝑙𝑡subscript𝐿𝑇subscript𝑙𝑚subscript𝐿𝑀𝑆𝐸L_{LM}=l_{s}\cdot L_{S}+l_{t}\cdot L_{T}+l_{m}\cdot L_{MSE}italic_L start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT italic_M italic_S italic_E end_POSTSUBSCRIPT (4)

where lssubscript𝑙𝑠l_{s}italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, ltsubscript𝑙𝑡l_{t}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and lmsubscript𝑙𝑚l_{m}italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are different factors applied to each of the loss term to align the magnitudes to the same range.

2.2.1 Weight Transferring

Weight transferring technique employs the pre-trained weights of a larger model to enhance the initialization of student models [19, 22]. Random initializations select different modes within the function space, compared to initializations with subspace sampling, which is trying to cluster within a singular mode [24]. The subspace initilizations are far from the initial weight space but exhibit similarities in the function space, resulting in a less diverse optimal space [24].

In this work, two weight initialization methods are compared: random initialization and weight transferring. Since the student model is characterized by a significantly reduced number of layers compared to the teacher model, we adopt an equidistant selection strategy [19, 20, 22] of teacher layers to transfer.

2.2.2 Dynamic Loss Weight Scheduling

The adaptation of function space from the teacher model to the compressed student model requires constant leveraging of the weights among the prior knowledge representation and learned knowledge representation at a given time. This is why weighting different aspects in the loss function contributes crucially to the model optimization process [25, 26]. While random weighting methods are theoretically stochastic versions of equal weighting approaches, they (random weighting) exhibit a higher likelihood of avoiding local minima [27, 28]. Consequently, this characteristic leads to improved generalization performance compared to equal weighting methods [27]. Although, the weights in the loss functions are crucial in the initial stage of the training, they lose importance in the later stages of training [29]. Therefore, in this work the goal is to create randomness in the gradient noise while training within the same mini-batch, which in turn implicitly forces the model to learn a generalised representation. We explore modifying the contribution of each loss term in (4) dynamically during each training step, as follows:

LLM=a1lsLS+a2ltLT+a3lmLMSEsubscript𝐿𝐿𝑀subscript𝑎1subscript𝑙𝑠subscript𝐿𝑆subscript𝑎2subscript𝑙𝑡subscript𝐿𝑇subscript𝑎3subscript𝑙𝑚subscript𝐿𝑀𝑆𝐸L_{LM}=a_{1}\cdot l_{s}\cdot L_{S}+a_{2}\cdot l_{t}\cdot L_{T}+a_{3}\cdot l_{m% }\cdot L_{MSE}italic_L start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT italic_M italic_S italic_E end_POSTSUBSCRIPT (5)

where ai,i{1,2,3}subscript𝑎𝑖𝑖123a_{i},i\in\{1,2,3\}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 1 , 2 , 3 } s.t. iai=1subscript𝑖subscript𝑎𝑖1\sum_{i}a_{i}=1∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 are drawn from a distribution created with Algorithm 1 or Algorithm 2, which is introduced in [30].

Algorithm 1 Sampling
Strategy 1 - S1
1:for i{1,2,3}𝑖123i\in\{1,2,3\}italic_i ∈ { 1 , 2 , 3 } do
2:    aiuniform(0,1)similar-tosubscript𝑎𝑖𝑢𝑛𝑖𝑓𝑜𝑟𝑚01a_{i}\sim uniform(0,1)italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_u italic_n italic_i italic_f italic_o italic_r italic_m ( 0 , 1 )
3:Sj=13aj𝑆superscriptsubscript𝑗13subscript𝑎𝑗S\leftarrow\sum_{j=1}^{3}{a_{j}}italic_S ← ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
4:for i{1,2,3}𝑖123i\in\{1,2,3\}italic_i ∈ { 1 , 2 , 3 } do
5:    aiaiSsubscript𝑎𝑖subscript𝑎𝑖𝑆a_{i}\leftarrow\frac{a_{i}}{S}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← divide start_ARG italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_S end_ARG
Algorithm 2 Sampling
Strategy 2 - S2[30]
1:n3𝑛3n\leftarrow 3italic_n ← 3
2:M232𝑀superscript232M\leftarrow 2^{32}italic_M ← 2 start_POSTSUPERSCRIPT 32 end_POSTSUPERSCRIPT
3:a𝑎absenta\leftarrowitalic_a ← Generate n1𝑛1n-1italic_n - 1 distinct values aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from [1,M1]1𝑀1[1,M-1][ 1 , italic_M - 1 ]
4:a𝑎absenta\leftarrowitalic_a ← sort(a𝑎aitalic_a)
5:a𝑎absenta\leftarrowitalic_a ← insert(a,0,0𝑎00a,0,0italic_a , 0 , 0)
6:a𝑎absenta\leftarrowitalic_a ← append(a,M𝑎𝑀a,Mitalic_a , italic_M)
7:a𝑎absenta\leftarrowitalic_a ← diff(a𝑎aitalic_a)
8:for i{1,,n}𝑖1𝑛i\in\{1,\ldots,n\}italic_i ∈ { 1 , … , italic_n } do
9:    aiaiMsubscript𝑎𝑖subscript𝑎𝑖𝑀a_{i}\leftarrow\frac{a_{i}}{M}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← divide start_ARG italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_M end_ARG

Training with randomly initialized loss weights aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT introduces gradient noise within the mini-batch, which enables further generalised learning for the model.

2.3 Decoder

The decoder of the EnCodec model [31] is used which is a neural audio codec that compresses the audio into a set of tokens, resulting in a lower bitrate and it features a streaming convolutional-based encoder-decoder architecture, with sequential modeling applied to the latent representation. The encoder includes four 1D convolution blocks with 64 channels, including residual units and down-sampling layers, as well as two-layer LSTM and a final 1D convolution. The decoder, mirroring the encoder design in reverse, transforms the latent representation into the output audio. Each downsampling step doubles the number of channels to enhance the model capabilities.

The Encodec distillation approach proposed is a combination of losses using the ground truth labels and the teacher output. Concretely, let x𝑥xitalic_x be the ground truth audio, x^Ssubscript^𝑥𝑆\hat{x}_{S}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT be the output generated by the student model, x^Tsubscript^𝑥𝑇\hat{x}_{T}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT be the generated audio by the teacher model. The reconstruction distillation loss in the time domain is given by:

lt(x^S,x^T)=x^Sx^T1subscript𝑙𝑡subscript^𝑥𝑆subscript^𝑥𝑇subscriptnormsubscript^𝑥𝑆subscript^𝑥𝑇1l_{t}(\hat{x}_{S},\hat{x}_{T})=||\hat{x}_{S}-\hat{x}_{T}||_{1}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = | | over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (6)

In the reconstruction distillation loss in the frequency domain is the linear combination between L1 and L2 losses over the Mel-spectrogram:

lf(x^S,x^T)=1|α||s|αiαieSi(x^S)Si(x^T)1+αiSi(x^S)Si(x^T)2subscript𝑙𝑓subscript^𝑥𝑆subscript^𝑥𝑇1𝛼𝑠subscriptsubscript𝛼𝑖𝛼subscript𝑖𝑒subscriptnormsubscript𝑆𝑖subscript^𝑥𝑆subscript𝑆𝑖subscript^𝑥𝑇1subscript𝛼𝑖subscriptnormsubscript𝑆𝑖subscript^𝑥𝑆subscript𝑆𝑖subscript^𝑥𝑇2l_{f}(\hat{x}_{S},\hat{x}_{T})=\frac{1}{|\alpha|\cdot|s|}\sum_{\alpha_{i}\in% \alpha}\sum_{i\in e}||S_{i}(\hat{x}_{S})-S_{i}(\hat{x}_{T})||_{1}\\ +\alpha_{i}||S_{i}(\hat{x}_{S})-S_{i}(\hat{x}_{T})||_{2}start_ROW start_CELL italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | italic_α | ⋅ | italic_s | end_ARG ∑ start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_α end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_e end_POSTSUBSCRIPT | | italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) - italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL + italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) - italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW (7)

The adversarial distillation loss for the generator is:

lg(x^S,x^T)=1Kkmax(0,1Dk(x^T))subscript𝑙𝑔subscript^𝑥𝑆subscript^𝑥𝑇1𝐾subscript𝑘01subscript𝐷𝑘subscript^𝑥𝑇l_{g}(\hat{x}_{S},\hat{x}_{T})=\frac{1}{K}\sum_{k}\max(0,1-D_{k}(\hat{x}_{T}))italic_l start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_max ( 0 , 1 - italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) (8)

where K𝐾Kitalic_K is the number of the discriminators utilized. The adversarial distillation feature matching loss for the generator is formulated as:

lfeat(x^S,x^T)=1KLk=1Kl=1LDkL(x^S)DkL(x^T)1mean(DkL(x^S)1)subscript𝑙𝑓𝑒𝑎𝑡subscript^𝑥𝑆subscript^𝑥𝑇1𝐾𝐿superscriptsubscript𝑘1𝐾superscriptsubscript𝑙1𝐿subscriptnormsuperscriptsubscript𝐷𝑘𝐿subscript^𝑥𝑆superscriptsubscript𝐷𝑘𝐿subscript^𝑥𝑇1meansubscriptnormsuperscriptsubscript𝐷𝑘𝐿subscript^𝑥𝑆1l_{feat}(\hat{x}_{S},\hat{x}_{T})=\frac{1}{KL}\sum_{k=1}^{K}\sum_{l=1}^{L}% \frac{||D_{k}^{L}(\hat{x}_{S})-D_{k}^{L}(\hat{x}_{T})||_{1}}{\text{mean}\left(% ||D_{k}^{L}(\hat{x}_{S})||_{1}\right)}italic_l start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_K italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT divide start_ARG | | italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG mean ( | | italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG (9)

The final loss LGsubscript𝐿𝐺L_{G}italic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT used to distill the teacher EnCodec is:

LG=subscript𝐿𝐺absent\displaystyle L_{G}=italic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = λt(lt(x,x^S)+lt(x^S,x^T))+λf(lf(x,x^S)\displaystyle\lambda_{t}\cdot(l_{t}(x,\hat{x}_{S})+l_{t}(\hat{x}_{S},\hat{x}_{% T}))+\lambda_{f}\cdot(l_{f}(x,\hat{x}_{S})italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ( italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) + italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) + italic_λ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ⋅ ( italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_x , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) (10)
+lf(x^S,x^T))+λg(lg(x,x^S)+lg(x^S,x^T))\displaystyle+l_{f}(\hat{x}_{S},\hat{x}_{T}))+\lambda_{g}\cdot(l_{g}(x,\hat{x}% _{S})+l_{g}(\hat{x}_{S},\hat{x}_{T}))+ italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) + italic_λ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⋅ ( italic_l start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_x , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) + italic_l start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) )
+λfeat(lfeat(x,x^S)+lfeat(x^S,x^T))+λw(lw(w))subscript𝜆𝑓𝑒𝑎𝑡subscript𝑙𝑓𝑒𝑎𝑡𝑥subscript^𝑥𝑆subscript𝑙𝑓𝑒𝑎𝑡subscript^𝑥𝑆subscript^𝑥𝑇subscript𝜆𝑤subscript𝑙𝑤𝑤\displaystyle+\lambda_{feat}\cdot(l_{feat}(x,\hat{x}_{S})+l_{feat}(\hat{x}_{S}% ,\hat{x}_{T}))+\lambda_{w}\cdot(l_{w}(w))+ italic_λ start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT ⋅ ( italic_l start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT ( italic_x , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) + italic_l start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) + italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ⋅ ( italic_l start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_w ) )

where λt,λf,λg,λfeat,λwsubscript𝜆𝑡subscript𝜆𝑓subscript𝜆𝑔subscript𝜆𝑓𝑒𝑎𝑡subscript𝜆𝑤\lambda_{t},\lambda_{f},\lambda_{g},\lambda_{feat},\lambda_{w}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT are the scalars to balance between the terms and their values, which are 0.1, 2, 4, 4 and 0.1. The lw(w)subscript𝑙𝑤𝑤l_{w}(w)italic_l start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_w ) loss measures the Euclidean distance between the current residual and its nearest codebook entry added over all residual steps in the batch. It should be emphasized that the discriminator was trained using a combination of teacher and student outputs, thus the discriminator loss was updated as:

Ld(x,x^S,x^T)=1Kk=1K2(max(0,1Dk(x))\displaystyle L_{d}(x,\hat{x}_{S},\hat{x}_{T})=\frac{1}{K}\sum_{k=1}^{K}2\cdot% (\max(0,1-D_{k}(x))italic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT 2 ⋅ ( roman_max ( 0 , 1 - italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ) (11)
+max(0,1+Dk(x^S)))+max(0,1+Dk(x^T))\displaystyle\quad\quad\quad\quad+\max(0,1+D_{k}(\hat{x}_{S})))+\max(0,1+D_{k}% (\hat{x}_{T}))+ roman_max ( 0 , 1 + italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ) ) + roman_max ( 0 , 1 + italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) )

Distillation in the EnCodec is only implemented in the decoder to maintain the same encoder and quantizer as the teacher, preventing potential token mismatch in the MusicGen token generation model. The number of filters in the convolution kernels is decreased to reduce the number of parameters. Although experiments involving a single LSTM layer, without altering the filter numbers were conducted, it led to a relatively high size of the decoder (\approx20M parameters).

3 Evaluation setup

The experiments in this work are performed on the MusicBench database [12]. MusicBench is derived from MusicCaps [5] by augmenting the audio with musically meaningful techniques (semitone pitch shifts, tempo changes and volume changes) and enhancing the captions. The total number of tracks for training is 52,768 leaving 400 samples for test.

We use two objective evaluation metrics to measure the performance of the music compute with the AudioLDM eval toolkit222https://github.com/haoheliu/audioldm_eval: Fréchet Audio Distance (FAD) [32] and Kullback–Leibler (KL) divergence. The FAD indicates the similarity of the generated audio set with the target audio set in terms of the VGGish [33] features distribution. The KL is instead measured between the generated sample and target sample features extracted with PANNs model[34] and then averaged for the entire set.

4 Experimental Analyses

In this section we analyse the performance of each of the three TinyTTM modules individually, as well as the performance of the TinyTTM on the MusicBench dataset.

4.1 Encoder analysis

In Table 1 we present the music generation performance of different encoder configurations. The evaluation is carried out by replacing the text encoder and fine-tuning the LM of the MusicGen pipeline. The LM configuration used for these experiments is the Variant 1 in Section 4.2, trained without ground truth labels, and the default decoder. The first two models, in Table 1, are the default ones without any modification. The last four are fine-tuned on the textual descriptions of MusicBench dataset exploiting span-based MLM objective and cross-entropy loss [9]. In the last two experiments we remove the last one and last two encoder layers (out of four layers) respectively with the aim of decreasing the size.

The scores in Table 1 represent the average of five distinct generations generated using different seeds. This table highlights the considerable performance gap between the default T5-base and T5-tiny in TTM. Fine-tuning the T5-tiny model leads to enhanced results, reducing both the FAD and KL scores. However, the T5-base default configuration still achieves lower scores at the cost of larger model size (approximately tenfold). Fine-tuning the T5-tiny with a frozen decoder does not yield improvements, potentially due to the limited trainable parameters. Finally, T5-tiny models fine-tuned with two or three encoder layers results in a slight deterioration compared to the initial T5-tiny model with four layers, suggesting a trade-off between model parameters and effectiveness.

Table 1: Performance on the different T5 configurations. FT: fine-tuned on MusicBench dataset (train set).
Method FAD KL #params
T5-base default 3.83 1.34 109.6M
T5-tiny default 4.14 1.64 11.3M
T5-tiny FT 3.92 1.41 11.3M
T5-tiny FT frozen decoder 4.17 1.50 11.3M
T5-tiny FT 3 encoder layers 4.02 1.48 10.6M
T5-tiny FT 2 encoder layers 4.00 1.46 9.8M

4.2 Transformer Language Model analysis

Two variants of compressed transformers have been employed in the following experiments. Variant 1 (V1) includes a reduced number of transformer layers, but preserves the dimensionality of the teacher, to perform weight transfer. Variant 2 (V2) comprises more transformer blocks cascade, but employs smaller dimensionality, in order to maintain the number of parameters with Variant 1. A more detailed presentation of each variant configuration is shown in Table 2.

Table 2: LM Variants details. Teacher is MusicGen-Small.
Feature Variant 1 Variant 2 Teacher
Layers 4 7 24
Heads 16 8 16
Transformer dim. 1024 720 1024
Parameters 84.71M 70.5 M 419.6M

4.2.1 Teacher LM Model Fine-Tuning

The MusicGen-small teacher model has been trained exclusively with audio samples that contain music without the presence of any vocals333https://github.com/facebookresearch/audiocraft/blob/main/model_cards/MUSICGEN_MODEL_CARD.md. Since MusicBench contains vocals, we also fine-tune the pretrained model on the MusicBench training set before distillation to enable the model to generate vocals. In Table 3 we analyse the performance on MusicBench entire test set (400 samples) and two subsets from it: V (192 samples) containing reference vocals in the text description and NV (208 samples) with no reference to vocals. This classification has been performed with the help of Mixtral 8x7B444https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 LLM [35].

Table 3: Comparative results regarding FAD, KL metrics on the Musicbench Test A set for the two candidate teacher models.
Entire set V subset NV subset
Model FAD KL FAD KL FAD KL
Teacher [1] 3.85 1.33 5.84 1.46 3.93 1.29
Teacher FT 2.73 1.20 3.19 1.24 3.37 1.23

Table 3 shows that the FAD and KL of the fine-tuned model (Teacher FT) have improved on the V subset, as well as on the overall test set. While the differences are minor on the NV set, where the fine-tuned model is still comparable to the pretrained one, in terms of FAD and KL.

4.2.2 Student LM Knowledge Distillation

In Table 4 we present an ablation study on different distillation configurations and show the impact on FAD and KL. The fine-tuned teacher model on MusicBench demonstrates a notable improvement over the pre-trained MusicGen-Small model. Compressed models V1 and V2 further improve the performance under different training regimes. Overall the V2 architecture shows lower FAD and KL. The incorporation of teacher loss and mse loss results in the most favorable outcomes in terms of FAD and KL, while the application of loss sampling strategies further optimizes the performance.

4.3 Decoder analysis

Table 5 illustrates the FAD and KL achieved with EnCodec. The target in the evaluation is the input audio to the EnCodec encoder which is compared to the output generated by the EnCodec decoder. The weight factor is applied to λt,λf,λg,λfeat,λwsubscript𝜆𝑡subscript𝜆𝑓subscript𝜆𝑔subscript𝜆𝑓𝑒𝑎𝑡subscript𝜆𝑤\lambda_{t},\lambda_{f},\lambda_{g},\lambda_{feat},\lambda_{w}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT which are the scalar coefficients used for balancing the gradients. Based on the results, the distilled EnCodec is able to achieve better performance compared to the fine-tuned (Tiny EnCodec FT) for the FAD metric, while there is a slight drop (0.01) in the KL metric. Adding a weight factor to the scalar coefficients, when distilling the model, has significantly improved the model’s performance in terms of FAD, while both models maintain the same performance for the KL metric. The experiment with the weight factor equal to 0.75 provided the best results among the other experiments. As a consequence, as decoder of the TinyTTM, the distilled EnCodec with weight factor equal to 0.75 was selected.

Table 4: Performance (mean ±plus-or-minus\pm± standard deviation) of distilled LM with various setups. H: Student loss, S: Teacher loss, mse: Intermediate MSE loss. 𝑺𝒊subscript𝑺𝒊\bm{S_{i}}bold_italic_S start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT: Sampling strategy i𝑖iitalic_i from 2.2.2.
Variant Init. Strat. FAD KL
Teacher [1] Random 3.85 ±plus-or-minus\pm± 0.10 1.38 ±plus-or-minus\pm± 0.01
Teacher FT Weight Copy 2.73 ±plus-or-minus\pm± 0.06 1.28 ±plus-or-minus\pm± 0.01
V1 H Random 4.08 ±plus-or-minus\pm± 0.11 1.61 ±plus-or-minus\pm± 0.04
V1 S Random 3.62 ±plus-or-minus\pm± 0.05 1.52 ±plus-or-minus\pm± 0.02
V1 H Weight copy 4.17 ±plus-or-minus\pm± 0.07 1.64 ±plus-or-minus\pm± 0.01
V1 S Weight copy 4.01 ±plus-or-minus\pm± 0.06 1.51 ±plus-or-minus\pm± 0.02
V1 H/mse Random 3.80 ±plus-or-minus\pm± 0.11 1.54 ±plus-or-minus\pm± 0.02
V1 S/mse Random 3.85 ±plus-or-minus\pm± 0.15 1.54 ±plus-or-minus\pm± 0.03
V1 H/S/mse Random 3.61 ±plus-or-minus\pm± 0.01 1.55 ±plus-or-minus\pm± 0.03
V2 H Random 4.20 ±plus-or-minus\pm± 0.05 1.62 ±plus-or-minus\pm± 0.01
V2 S Random 3.76 ±plus-or-minus\pm± 0.09 1.55 ±plus-or-minus\pm± 0.01
V2 H/mse Random 3.83 ±plus-or-minus\pm± 0.01 1.57 ±plus-or-minus\pm± 0.01
V2 S/mse Random 3.47 ±plus-or-minus\pm± 0.12 1.48 ±plus-or-minus\pm± 0.04
V2 H/S/mse Random 3.45 ±plus-or-minus\pm± 0.03 1.51 ±plus-or-minus\pm± 0.03
V2 H/S/mse/S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Random 3.40 ±plus-or-minus\pm± 0.02 1.49 ±plus-or-minus\pm± 0.01
V2 H/S/mse/S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Random 3.53 ±plus-or-minus\pm± 0.03 1.43 ±plus-or-minus\pm± 0.02
Table 5: Performance on EnCodec model. w.f.: weight factor
Method FAD KL #params
Teacher EnCodec 3.84 1.15 28.4M
Tiny EnCodec FT 3.21 0.88 7.43M
Distilled EnCodec 3.17 0.89 7.43M
Distilled EnCodec (w.f. = 0.75) 2.87 0.88 7.43M
Distilled EnCodec (w.f. = 1.25) 3.15 0.89 7.43M

4.4 TinyTTM performance

We combined the best performing TTM components in the previous sections to create TinyTTM. Table 6 illustrates a comparison between the two candidate teacher models and the best performing TinyTTM models, in terms of FAD, KL and total parameter count. It is observed that the compressed models are better than the teacher model in terms of FAD and KL, however do not achieve the performance of the fine-tuned teacher. The size compression is up to ×6.25absent6.25\times 6.25× 6.25 with ×2.8absent2.8\times 2.8× 2.8 latency reduction.555Comparison using a single Nvidia A10 GPU. Although TinyTTM achieves a low FAD and KL, some of the generated tracks have noticeable issues: beat synchronization, singing voice not created or not sounding like voice. This can be due to the low audio diversity in MusicBench as this dataset is created from only 5k tracks.

Table 6: Performance comparison (for 3 runs with different seeds) between the proposed TinyTTM model and MusicGen-Small on the MusicBench test set A.
Model/Method FAD KL #params
Teacher [1] 3.85 ±0.11plus-or-minus0.11\pm 0.11± 0.11 1.33 ±0.01plus-or-minus0.01\pm 0.01± 0.01 557.6M
Teacher FT 2.73 ±0.05plus-or-minus0.05\pm 0.05± 0.05 1.20 ±0.01plus-or-minus0.01\pm 0.01± 0.01 557.6M
V2 H/S/mse/S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 3.85 ±0.06plus-or-minus0.06\pm 0.06± 0.06 1.39 ±0.02plus-or-minus0.02\pm 0.02± 0.02 89.2M
V2 H/S/mse/S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 3.66 ±0.14plus-or-minus0.14\pm 0.14± 0.14 1.32 ±0.01plus-or-minus0.01\pm 0.01± 0.01 89.2M

5 Conclusions

We have presented TinyTTM, a TTM model that operates within a parameter-constrained framework. We highlighted the limitations of employing only ground truth labels for training as outlined in [1], and discussed the enhancements achieved through KD techniques applied to each of the TTM components. With a 89.2M parameters we are able to achieve FAD of 3.66 and KL of 1.32, which is lower than the model in [1], but not as low as the fine-tuned counterpart. Future directions include expanding the exploration of KD techniques across broader and more diverse datasets to solve observed issues when training with MusicBench (like beat synchronization or voice distortion ) and thus reducing the performance gap between TinyTTM and MusicGen-Small fine-tuned.

References

  • [1] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” in NeurIPS 2023, 2023.
  • [2] K. Chen, Y. Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov, “MusicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,” in ICASSP, 2024.
  • [3] Q. Huang, A. Jansen, J. Lee, R. Ganti, J. Y. Li, and D. P. W. Ellis, “MuLan: A joint embedding of music audio and natural language,” in ISMIR, 2022, pp. 559–566.
  • [4] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP, 2023, pp. 1–5.
  • [5] A. Agostinelli, T. I. Denk, Z. Borsos, J. H. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. H. Frank, “MusicLM: Generating music from text,” 2023.
  • [6] Z. Evans, C. Carr, J. Taylor, S. H. Hawley, and J. Pons, “Fast timing-conditioned latent audio diffusion,” arXiv, vol. abs/2402.04825, 2024.
  • [7] H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. P. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM : Text-to-audio generation with latent diffusion models,” vol. 202, pp. 21 450–21 474, 2023.
  • [8] H. Liu, Q. Tian, Y. Yuan, X. Liu, X. Mei, Q. Kong, Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley, “AudioLDM 2: Learning holistic audio generation with self-supervised pretraining,” arXiv, vol. abs/2308.05734, 2023.
  • [9] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020.
  • [10] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei, “Scaling instruction-finetuned language models,” arXiv, vol. abs/2210.11416, 2022.
  • [11] A. Ziv, I. Gat, G. L. Lan, T. Remez, F. Kreuk, A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “Masked audio generation using a single non-autoregressive transformer,” arXiv, vol. abs/2401.04577, 2024.
  • [12] J. Melechovsky, Z. Guo, D. Ghosal, N. Majumder, D. Herremans, and S. Poria, “Mustango: Toward controllable text-to-music generation,” arXiv, vol. abs/2311.08355, 2023.
  • [13] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and Classification of Acoustic Scenes and Events,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.
  • [14] Y. Liu, H. Sun, G. Chen, Q. Wang, Z. Zhao, X. Lu, and L. Wang, “Multi-level knowledge distillation for speech emotion recognition in noisy conditions,” arXiv, vol. abs/2312.13556, 2023.
  • [15] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “TinyBERT: Distilling BERT for natural language understanding,” in EMNLP, 2020, pp. 4163–4174.
  • [16] H. Shao, W. Wang, B. Liu, X. Gong, H. Wang, and Y. Qian, “Whisper-KDQ: A lightweight whisper via guided knowledge distillation and quantization for efficient ASR,” arXiv, vol. abs/2305.10788, 2023.
  • [17] S. Sun, Y. Cheng, Z. Gan, and J. Liu, “Patient knowledge distillation for bert model compression,” in EMNLP, 2019, pp. 4323–4332.
  • [18] W. Su, X. Chen, S. Feng, J. Liu, W. Liu, Y. Sun, H. Tian, H. Wu, and H. Wang, “ERNIE-Tiny : A progressive distillation framework for pretrained transformer compression,” arXiv, vol. abs/2106.02241, 2021.
  • [19] C. Lu, J. Zhang, Y. Chu, Z. Chen, J. Zhou, F. Wu, H. Chen, and H. Yang, “Knowledge distillation of transformer-based language models revisited,” arXiv, vol. abs/2206.14366, 2022.
  • [20] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter,” arXiv, vol. abs/1910.01108, 2019. [Online]. Available: http://arxiv.longhoe.net/abs/1910.01108
  • [21] H. Chang, S. Yang, and H. Lee, “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” pp. 7087–7091, 2022. [Online]. Available: https://doi.org/10.1109/ICASSP43922.2022.9747490
  • [22] X. Lv, P. Zhang, S. Li, G. Gan, and Y. Sun, “Lightformer: Light-weight transformer using svd-based weight transfer and parameter sharing,” in ACL, 2023, pp. 10 323–10 335.
  • [23] Y. Tay, M. Dehghani, J. Rao, W. Fedus, S. Abnar, H. W. Chung, S. Narang, D. Yogatama, A. Vaswani, and D. Metzler, “Scale efficiently: Insights from pre-training and fine-tuning transformers,” arXiv, vol. abs/2109.10686, 2021.
  • [24] S. Fort, H. Hu, and B. Lakshminarayanan, “Deep ensembles: A loss landscape perspective,” arXiv, vol. abs/1912.02757, 2019.
  • [25] H. Shimodaira, “Improving predictive inference under covariate shift by weighting the log-likelihood function,” Journal of Statistical Planning and Inference, vol. 90, pp. 227–244, 2000.
  • [26] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann, “Self-paced curriculum learning,” in AAAI, 2015, pp. 2694–2700.
  • [27] J. Shu, Q. ** for sample weighting,” in NeurIPS, 2019, pp. 1917–1928.
  • [28] T. Fang, N. Lu, G. Niu, and M. Sugiyama, “Rethinking importance weighting for deep learning under distribution shift,” in NeurIPS, 2020.
  • [29] J. Byrd and Z. C. Lipton, “What is the effect of importance weighting in deep learning?” in ICML, vol. 97, 2019, pp. 872–881.
  • [30] N. A. Smith and R. W. Tromble, “Sampling uniformly from the unit simplex,” Department of Computer Science / Center for Language and Speech Processing, Johns Hopkins University, 8 2004.
  • [31] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” arXiv, vol. abs/2210.13438, 2022.
  • [32] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms,” in Interspeech, 2019, pp. 2350–2354.
  • [33] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. W. Wilson, “CNN architectures for large-scale audio classification,” in ICASSP, 2017, pp. 131–135.
  • [34] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 28, pp. 2880–2894, 2020.
  • [35] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” arXiv, vol. abs/2401.04088, 2024.