\interspeechcameraready\name

[affiliation=2]VasileiosMoschopoulos \name[affiliation=2]ThanasisKotsiopoulos \name[affiliation=1]PabloPeso Parada \name[affiliation=2]KonstantinosNikiforidis \name[affiliation=3]AlexandrosStergiadis \name[affiliation=2]GerasimosPapakostas \name[affiliation=1]Md AsifJalal \name[affiliation=1]JisiZhang \name[affiliation=2]AnastasiosDrosou \name[affiliation=1]KarthikeyanSaravanan

Exploring compressibility of transformer based text-to-music (TTM) models

Abstract

State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the various components of the TTM model (encoder, generative model and the decoder). Leveraging these methods we create TinyTTM (89.2M params) that achieves a FAD of 3.66 and KL of 1.32 on MusicBench dataset, better than MusicGen-Small (557.6M params) but not lower than MusicGen-small fine-tuned on MusicBench.

keywords:

Text-To-Music, Distillation, Compression.

1 Introduction

Recent developments in generative AI have facilitated the emergence of various novel applications, including text-to-music (TTM) systems to generate music content from textual prompts. These systems allow users to efficiently craft music compositions that fulfill their specific requirements, reducing the necessity for expertise in music composition. Recent works on this topic, such as MusicGen [1] or MusicLDM [2], have demonstrated the capability of generating high-quality music samples, highlighting the potential of TTM.

TTM models consist of three primary components: a text encoder, a generative model and a decoder. The text encoder is responsible for translating the input text into an embedding representation. Different text encoders are employed in the literature such as a contrastive language-audio pretrained model based on Mulan [3] or CLAP [4] employed in MusicLM [5], Stable Audio[6], AudioLDM[7] and AudioLDM2 [8] or language model based [9, 10] integrated in MAGNeT [11], MusigGen [1], Mustango [12] and AudioLDM2 [8]. There are two primary approaches used for the generative model: autoregressive/nonautoregressive transformers as in [5, 1, 11] where the transformer produces a sequence of discrete tokens conditioned to the encoder embedding via the cross-attention mechanism; and diffusion based models [7, 8, 2, 6, 12] which learn to denoise a randomly generated latent space conditioned to the encoder embedding. The decoder, accepts the generative model output and transforms it into a time-domain signal via audio codecs [5, 1, 11, 7, 8, 2, 12, 6].

Contemporary TTM models [1, 7] are large with 100s of millions of parameters. Even the smaller variants of [1], and [7] have 557.6M (authors mention 300M parameters but this refers to autoregressive transformer only), and 428M parameters respectively. It is essential to minimise the size of such models to deploy on resource-constrained devices. In this work we propose and analyse different methods to reduce the size of each component in TTM. A popular approach towards reducing model size is knowledge distillation, where the “knowledge” of the original model often referred to as the “teacher” is distilled into a smaller target size model referred to as the “student”. This is typically quantified with the Kullback-Leibler divergence loss [13, 14], which measures the discrepancy between teacher and student output distributions. Furthermore, the intermediate layer outputs of the teacher model can be also exploited [15, 16, 17, 18], providing the student model with access to deeper knowledge and encouraging finer alignment between the models’ internal representations. Another common strategy is weight transfer [19], where part of the teacher weight matrices are copied directly to the student model [20, 21, 22], to accelerate learning and improve final performance [16].

Figure 1 summarizes this work, where we start from a pretrained MusicGen-Small model and through methods presented in this paper, reduce the model size from 557.6M parameters down to 89.2M parameters. This reduction process is achieved by individually optimizing each of the three components of the TTM model through a method that combines distillation and fine-tuning techniques using MusicBench [12] dataset.

Refer to caption — Figure 1: MusicGen-Small vs. proposed TinyTTM.

Contributions of this work are:

•

The creation of a tiny TTM system containing 89.2M parameters, for first time to the best of our knowledge, demonstrating performance comparable to larger systems and more suitable for deployment on resource-constrained devices.
•

A knowledge distillation technique applying adaptive learning using stochastic loss sampling is proposed. In this method, we give new importance to different parts of the loss - like teacher loss, ground-truth loss, and model intermediate loss - with each step of distillation. This loss weight scheduling helps us perform better than existing methods where the weights given to each part of the loss is fixed. In contrast to fixed weighting, our method improves representation learning by adding uncertainty to how we pick the losses. Two different distributions are explored to model the loss weight scheduling.

2 TinyTTM

This section describes the MusicGen [1] components and the methods proposed to effectively reduce the model size.

2.1 Encoder

The encoder is used to create the embedding vector for conditioning the generator. This work uses T5 (Text-to-Text Transformer) [9] following the structure proposed by MusicGen[1].

The T5 model, based on a Transformer encoder-decoder architecture, is offered in various sizes [23]. This paper focuses predominantly on T5-tiny (utilized in TinyTTM) and compares the performance with T5-base (implemented in MusicGen-small). We evaluate the T5-tiny pretrained model¹¹1https://huggingface.co/google/t5-efficient-tiny and fine-tune it on text containing music descriptions from MusicBench. In order to fine-tune the model we leverage the span-based MLM objective and cross-entropy loss [9]. During our investigation, we specifically explore the strategy of freezing the decoder layers during training, as these layers are not utilized during inference. Additionally, to further reduce the four layer T5-Tiny encoder we explore drop** the last layers.

2.2 Generative Transformer Language Model (LM)

The Language Model (LM) in MusicGen [1] is trained using cross-entropy loss between the predicted output logits and the ground truth audio tokens generated by EnCodec. We define the student loss $L_{S}$ as:

L_{S}=-\frac{1}{K}\sum_{k=1}^{K}\sum_{t=1}^{T}\sum_{c=1}^{C}y_{k,t,c}\log(p_{k% ,t,c})

(1)

where $T$ are the temporal steps, $K$ the number of codebooks, $C$ the cardinality of each codebook, and $y_{t,c}$ and $p_{t,c}$ represent the ground truth and predicted probability of the $c^{th}$ token at time $t$ , respectively. Batch dimension is ommited for simplicity.

In this work, prior knowledge from a teacher model is leveraged to guide the student model. This is accomplished through the utilization of both the teacher LM outputs, as well as the hidden representations within the transformer architecture. We define the teacher loss $L_{T}$ as KL divergence based [19, 16]:

L_{T}=-\frac{1}{K}\sum_{k=1}^{K}\sum_{t=1}^{T}\sum_{c=1}^{C}q_{k,t,c}\log\left% (\frac{q_{k,t,c}}{p_{k,t,c}}\right)

(2)

where $q_{k,t,c}$ , $p_{k,t,c}$ are the predicted probability of the $c^{th}$ token at time $t$ for the $k^{th}$ codebook according to teacher and student model, respectively. The teacher loss $L_{T}$ is the averaged KL divergence across all $K$ codebooks. Additionally, the teacher intermediate transformer layer outputs $\hat{y}_{T}$ are being utilized as guidance to align the student corresponding outputs $\hat{y}_{S}$ :

L_{MSE}(\hat{y}_{T},\hat{y}_{S})=\frac{1}{K_{tr}}\sum_{k=1}^{K_{tr}}|\hat{y}_{% T,m(k)}-\hat{y}_{S,k}|^{2}

(3)

where $K_{tr}$ denotes the number of student transformer layers, and $|\cdot|^{2}$ signifies the squared Euclidean norm. The additional subscript in $\hat{y}_{T,m(k)}$ and $\hat{y}_{S,k}$ represent the layer index and $m(k)$ represents a function that maps the teacher transformer layer ids to the corresponding ones from the student, based on the layer matching strategy that is employed. The final loss is given by the weighted sum of the aforementioned loss terms:

L_{LM}=l_{s}\cdot L_{S}+l_{t}\cdot L_{T}+l_{m}\cdot L_{MSE}

(4)

where $l_{s}$ , $l_{t}$ and $l_{m}$ are different factors applied to each of the loss term to align the magnitudes to the same range.

2.2.1 Weight Transferring

Weight transferring technique employs the pre-trained weights of a larger model to enhance the initialization of student models [19, 22]. Random initializations select different modes within the function space, compared to initializations with subspace sampling, which is trying to cluster within a singular mode [24]. The subspace initilizations are far from the initial weight space but exhibit similarities in the function space, resulting in a less diverse optimal space [24].

In this work, two weight initialization methods are compared: random initialization and weight transferring. Since the student model is characterized by a significantly reduced number of layers compared to the teacher model, we adopt an equidistant selection strategy [19, 20, 22] of teacher layers to transfer.

2.2.2 Dynamic Loss Weight Scheduling

The adaptation of function space from the teacher model to the compressed student model requires constant leveraging of the weights among the prior knowledge representation and learned knowledge representation at a given time. This is why weighting different aspects in the loss function contributes crucially to the model optimization process [25, 26]. While random weighting methods are theoretically stochastic versions of equal weighting approaches, they (random weighting) exhibit a higher likelihood of avoiding local minima [27, 28]. Consequently, this characteristic leads to improved generalization performance compared to equal weighting methods [27]. Although, the weights in the loss functions are crucial in the initial stage of the training, they lose importance in the later stages of training [29]. Therefore, in this work the goal is to create randomness in the gradient noise while training within the same mini-batch, which in turn implicitly forces the model to learn a generalised representation. We explore modifying the contribution of each loss term in (4) dynamically during each training step, as follows:

L_{LM}=a_{1}\cdot l_{s}\cdot L_{S}+a_{2}\cdot l_{t}\cdot L_{T}+a_{3}\cdot l_{m% }\cdot L_{MSE}

(5)

where $a_{i},i\in\{1,2,3\}$ s.t. $\sum_{i}a_{i}=1$ are drawn from a distribution created with Algorithm 1 or Algorithm 2, which is introduced in [30].

Algorithm 1 Sampling
Strategy 1 - S1

1:for

i\in\{1,2,3\}

a_{i}\sim uniform(0,1)

S\leftarrow\sum_{j=1}^{3}{a_{j}}

4:for

i\in\{1,2,3\}

a_{i}\leftarrow\frac{a_{i}}{S}

Algorithm 2 Sampling
Strategy 2 - S2[30]

n\leftarrow 3

M\leftarrow 2^{32}

a\leftarrow

Generate

n-1

distinct values

a_{i}

from

[1,M-1]

a\leftarrow

sort(

a

)

a\leftarrow

insert(

a,0,0

)

a\leftarrow

append(

a,M

)

a\leftarrow

diff(

a

)

8:for

i\in\{1,\ldots,n\}

a_{i}\leftarrow\frac{a_{i}}{M}

Training with randomly initialized loss weights $a_{i}$ introduces gradient noise within the mini-batch, which enables further generalised learning for the model.

2.3 Decoder

The decoder of the EnCodec model [31] is used which is a neural audio codec that compresses the audio into a set of tokens, resulting in a lower bitrate and it features a streaming convolutional-based encoder-decoder architecture, with sequential modeling applied to the latent representation. The encoder includes four 1D convolution blocks with 64 channels, including residual units and down-sampling layers, as well as two-layer LSTM and a final 1D convolution. The decoder, mirroring the encoder design in reverse, transforms the latent representation into the output audio. Each downsampling step doubles the number of channels to enhance the model capabilities.

The Encodec distillation approach proposed is a combination of losses using the ground truth labels and the teacher output. Concretely, let $x$ be the ground truth audio, $\hat{x}_{S}$ be the output generated by the student model, $\hat{x}_{T}$ be the generated audio by the teacher model. The reconstruction distillation loss in the time domain is given by:

l_{t}(\hat{x}_{S},\hat{x}_{T})=||\hat{x}_{S}-\hat{x}_{T}||_{1}

(6)

In the reconstruction distillation loss in the frequency domain is the linear combination between L1 and L2 losses over the Mel-spectrogram:

l_{f}(\hat{x}_{S},\hat{x}_{T})=\frac{1}{|\alpha|\cdot|s|}\sum_{\alpha_{i}\in% \alpha}\sum_{i\in e}||S_{i}(\hat{x}_{S})-S_{i}(\hat{x}_{T})||_{1}\\ +\alpha_{i}||S_{i}(\hat{x}_{S})-S_{i}(\hat{x}_{T})||_{2}

(7)

The adversarial distillation loss for the generator is:

l_{g}(\hat{x}_{S},\hat{x}_{T})=\frac{1}{K}\sum_{k}\max(0,1-D_{k}(\hat{x}_{T}))

(8)

where $K$ is the number of the discriminators utilized. The adversarial distillation feature matching loss for the generator is formulated as:

l_{feat}(\hat{x}_{S},\hat{x}_{T})=\frac{1}{KL}\sum_{k=1}^{K}\sum_{l=1}^{L}% \frac{||D_{k}^{L}(\hat{x}_{S})-D_{k}^{L}(\hat{x}_{T})||_{1}}{\text{mean}\left(% ||D_{k}^{L}(\hat{x}_{S})||_{1}\right)}

(9)

The final loss $L_{G}$ used to distill the teacher EnCodec is:

$\displaystyle L_{G}=$	$\displaystyle\lambda_{t}\cdot(l_{t}(x,\hat{x}_{S})+l_{t}(\hat{x}_{S},\hat{x}_{% T}))+\lambda_{f}\cdot(l_{f}(x,\hat{x}_{S})$	(10)
	$\displaystyle+l_{f}(\hat{x}_{S},\hat{x}_{T}))+\lambda_{g}\cdot(l_{g}(x,\hat{x}% _{S})+l_{g}(\hat{x}_{S},\hat{x}_{T}))$
	$\displaystyle+\lambda_{feat}\cdot(l_{feat}(x,\hat{x}_{S})+l_{feat}(\hat{x}_{S}% ,\hat{x}_{T}))+\lambda_{w}\cdot(l_{w}(w))$

where $\lambda_{t},\lambda_{f},\lambda_{g},\lambda_{feat},\lambda_{w}$ are the scalars to balance between the terms and their values, which are 0.1, 2, 4, 4 and 0.1. The $l_{w}(w)$ loss measures the Euclidean distance between the current residual and its nearest codebook entry added over all residual steps in the batch. It should be emphasized that the discriminator was trained using a combination of teacher and student outputs, thus the discriminator loss was updated as:

	$\displaystyle L_{d}(x,\hat{x}_{S},\hat{x}_{T})=\frac{1}{K}\sum_{k=1}^{K}2\cdot% (\max(0,1-D_{k}(x))$		(11)
	$\displaystyle\quad\quad\quad\quad+\max(0,1+D_{k}(\hat{x}_{S})))+\max(0,1+D_{k}% (\hat{x}_{T}))$

Distillation in the EnCodec is only implemented in the decoder to maintain the same encoder and quantizer as the teacher, preventing potential token mismatch in the MusicGen token generation model. The number of filters in the convolution kernels is decreased to reduce the number of parameters. Although experiments involving a single LSTM layer, without altering the filter numbers were conducted, it led to a relatively high size of the decoder ( $\approx$ 20M parameters).

3 Evaluation setup

The experiments in this work are performed on the MusicBench database [12]. MusicBench is derived from MusicCaps [5] by augmenting the audio with musically meaningful techniques (semitone pitch shifts, tempo changes and volume changes) and enhancing the captions. The total number of tracks for training is 52,768 leaving 400 samples for test.

We use two objective evaluation metrics to measure the performance of the music compute with the AudioLDM eval toolkit²²2https://github.com/haoheliu/audioldm_eval: Fréchet Audio Distance (FAD) [32] and Kullback–Leibler (KL) divergence. The FAD indicates the similarity of the generated audio set with the target audio set in terms of the VGGish [33] features distribution. The KL is instead measured between the generated sample and target sample features extracted with PANNs model[34] and then averaged for the entire set.

4 Experimental Analyses

In this section we analyse the performance of each of the three TinyTTM modules individually, as well as the performance of the TinyTTM on the MusicBench dataset.

4.1 Encoder analysis

In Table 1 we present the music generation performance of different encoder configurations. The evaluation is carried out by replacing the text encoder and fine-tuning the LM of the MusicGen pipeline. The LM configuration used for these experiments is the Variant 1 in Section 4.2, trained without ground truth labels, and the default decoder. The first two models, in Table 1, are the default ones without any modification. The last four are fine-tuned on the textual descriptions of MusicBench dataset exploiting span-based MLM objective and cross-entropy loss [9]. In the last two experiments we remove the last one and last two encoder layers (out of four layers) respectively with the aim of decreasing the size.

The scores in Table 1 represent the average of five distinct generations generated using different seeds. This table highlights the considerable performance gap between the default T5-base and T5-tiny in TTM. Fine-tuning the T5-tiny model leads to enhanced results, reducing both the FAD and KL scores. However, the T5-base default configuration still achieves lower scores at the cost of larger model size (approximately tenfold). Fine-tuning the T5-tiny with a frozen decoder does not yield improvements, potentially due to the limited trainable parameters. Finally, T5-tiny models fine-tuned with two or three encoder layers results in a slight deterioration compared to the initial T5-tiny model with four layers, suggesting a trade-off between model parameters and effectiveness.

Table 1: Performance on the different T5 configurations. FT: fine-tuned on MusicBench dataset (train set).

Method	FAD	KL	#params
T5-base default	3.83	1.34	109.6M
T5-tiny default	4.14	1.64	11.3M
T5-tiny FT	3.92	1.41	11.3M
T5-tiny FT frozen decoder	4.17	1.50	11.3M
T5-tiny FT 3 encoder layers	4.02	1.48	10.6M
T5-tiny FT 2 encoder layers	4.00	1.46	9.8M

4.2 Transformer Language Model analysis

Two variants of compressed transformers have been employed in the following experiments. Variant 1 (V1) includes a reduced number of transformer layers, but preserves the dimensionality of the teacher, to perform weight transfer. Variant 2 (V2) comprises more transformer blocks cascade, but employs smaller dimensionality, in order to maintain the number of parameters with Variant 1. A more detailed presentation of each variant configuration is shown in Table 2.

Table 2: LM Variants details. Teacher is MusicGen-Small.

Feature	Variant 1	Variant 2	Teacher
Layers	4	7	24
Heads	16	8	16
Transformer dim.	1024	720	1024
Parameters	84.71M	70.5 M	419.6M

4.2.1 Teacher LM Model Fine-Tuning

The MusicGen-small teacher model has been trained exclusively with audio samples that contain music without the presence of any vocals³³3https://github.com/facebookresearch/audiocraft/blob/main/model_cards/MUSICGEN_MODEL_CARD.md. Since MusicBench contains vocals, we also fine-tune the pretrained model on the MusicBench training set before distillation to enable the model to generate vocals. In Table 3 we analyse the performance on MusicBench entire test set (400 samples) and two subsets from it: V (192 samples) containing reference vocals in the text description and NV (208 samples) with no reference to vocals. This classification has been performed with the help of Mixtral 8x7B⁴⁴4https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 LLM [35].

Table 3: Comparative results regarding FAD, KL metrics on the Musicbench Test A set for the two candidate teacher models.

	Entire set		V subset		NV subset
Model	FAD	KL	FAD	KL	FAD	KL
Teacher [1]	3.85	1.33	5.84	1.46	3.93	1.29
Teacher FT	2.73	1.20	3.19	1.24	3.37	1.23

Table 3 shows that the FAD and KL of the fine-tuned model (Teacher FT) have improved on the V subset, as well as on the overall test set. While the differences are minor on the NV set, where the fine-tuned model is still comparable to the pretrained one, in terms of FAD and KL.

4.2.2 Student LM Knowledge Distillation

In Table 4 we present an ablation study on different distillation configurations and show the impact on FAD and KL. The fine-tuned teacher model on MusicBench demonstrates a notable improvement over the pre-trained MusicGen-Small model. Compressed models V1 and V2 further improve the performance under different training regimes. Overall the V2 architecture shows lower FAD and KL. The incorporation of teacher loss and mse loss results in the most favorable outcomes in terms of FAD and KL, while the application of loss sampling strategies further optimizes the performance.

4.3 Decoder analysis

Table 5 illustrates the FAD and KL achieved with EnCodec. The target in the evaluation is the input audio to the EnCodec encoder which is compared to the output generated by the EnCodec decoder. The weight factor is applied to $\lambda_{t},\lambda_{f},\lambda_{g},\lambda_{feat},\lambda_{w}$ which are the scalar coefficients used for balancing the gradients. Based on the results, the distilled EnCodec is able to achieve better performance compared to the fine-tuned (Tiny EnCodec FT) for the FAD metric, while there is a slight drop (0.01) in the KL metric. Adding a weight factor to the scalar coefficients, when distilling the model, has significantly improved the model’s performance in terms of FAD, while both models maintain the same performance for the KL metric. The experiment with the weight factor equal to 0.75 provided the best results among the other experiments. As a consequence, as decoder of the TinyTTM, the distilled EnCodec with weight factor equal to 0.75 was selected.

Table 4: Performance (mean

\pm

standard deviation) of distilled LM with various setups. H: Student loss, S: Teacher loss, mse: Intermediate MSE loss.

\bm{S_{i}}

: Sampling strategy

i

from 2.2.2.

Variant	Init. Strat.	FAD	KL
Teacher [1]	Random	3.85 $\pm$ 0.10	1.38 $\pm$ 0.01
Teacher FT	Weight Copy	2.73 $\pm$ 0.06	1.28 $\pm$ 0.01
V1 H	Random	4.08 $\pm$ 0.11	1.61 $\pm$ 0.04
V1 S	Random	3.62 $\pm$ 0.05	1.52 $\pm$ 0.02
V1 H	Weight copy	4.17 $\pm$ 0.07	1.64 $\pm$ 0.01
V1 S	Weight copy	4.01 $\pm$ 0.06	1.51 $\pm$ 0.02
V1 H/mse	Random	3.80 $\pm$ 0.11	1.54 $\pm$ 0.02
V1 S/mse	Random	3.85 $\pm$ 0.15	1.54 $\pm$ 0.03
V1 H/S/mse	Random	3.61 $\pm$ 0.01	1.55 $\pm$ 0.03
V2 H	Random	4.20 $\pm$ 0.05	1.62 $\pm$ 0.01
V2 S	Random	3.76 $\pm$ 0.09	1.55 $\pm$ 0.01
V2 H/mse	Random	3.83 $\pm$ 0.01	1.57 $\pm$ 0.01
V2 S/mse	Random	3.47 $\pm$ 0.12	1.48 $\pm$ 0.04
V2 H/S/mse	Random	3.45 $\pm$ 0.03	1.51 $\pm$ 0.03
V2 H/S/mse/ $S_{1}$	Random	3.40 $\pm$ 0.02	1.49 $\pm$ 0.01
V2 H/S/mse/ $S_{2}$	Random	3.53 $\pm$ 0.03	1.43 $\pm$ 0.02

Table 5: Performance on EnCodec model. w.f.: weight factor

Method	FAD	KL	#params
Teacher EnCodec	3.84	1.15	28.4M
Tiny EnCodec FT	3.21	0.88	7.43M
Distilled EnCodec	3.17	0.89	7.43M
Distilled EnCodec (w.f. = 0.75)	2.87	0.88	7.43M
Distilled EnCodec (w.f. = 1.25)	3.15	0.89	7.43M

4.4 TinyTTM performance

We combined the best performing TTM components in the previous sections to create TinyTTM. Table 6 illustrates a comparison between the two candidate teacher models and the best performing TinyTTM models, in terms of FAD, KL and total parameter count. It is observed that the compressed models are better than the teacher model in terms of FAD and KL, however do not achieve the performance of the fine-tuned teacher. The size compression is up to $\times 6.25$ with $\times 2.8$ latency reduction.⁵⁵5Comparison using a single Nvidia A10 GPU. Although TinyTTM achieves a low FAD and KL, some of the generated tracks have noticeable issues: beat synchronization, singing voice not created or not sounding like voice. This can be due to the low audio diversity in MusicBench as this dataset is created from only 5k tracks.

Table 6: Performance comparison (for 3 runs with different seeds) between the proposed TinyTTM model and MusicGen-Small on the MusicBench test set A.

Model/Method	FAD	KL	#params
Teacher [1]	3.85 $\pm 0.11$	1.33 $\pm 0.01$	557.6M
Teacher FT	2.73 $\pm 0.05$	1.20 $\pm 0.01$	557.6M
V2 H/S/mse/ $S_{1}$	3.85 $\pm 0.06$	1.39 $\pm 0.02$	89.2M
V2 H/S/mse/ $S_{2}$	3.66 $\pm 0.14$	1.32 $\pm 0.01$	89.2M

5 Conclusions

We have presented TinyTTM, a TTM model that operates within a parameter-constrained framework. We highlighted the limitations of employing only ground truth labels for training as outlined in [1], and discussed the enhancements achieved through KD techniques applied to each of the TTM components. With a 89.2M parameters we are able to achieve FAD of 3.66 and KL of 1.32, which is lower than the model in [1], but not as low as the fine-tuned counterpart. Future directions include expanding the exploration of KD techniques across broader and more diverse datasets to solve observed issues when training with MusicBench (like beat synchronization or voice distortion ) and thus reducing the performance gap between TinyTTM and MusicGen-Small fine-tuned.

References

[1] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” in NeurIPS 2023, 2023.
[2] K. Chen, Y. Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov, “MusicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,” in ICASSP, 2024.
[3] Q. Huang, A. Jansen, J. Lee, R. Ganti, J. Y. Li, and D. P. W. Ellis, “MuLan: A joint embedding of music audio and natural language,” in ISMIR, 2022, pp. 559–566.
[4] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP, 2023, pp. 1–5.
[5] A. Agostinelli, T. I. Denk, Z. Borsos, J. H. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. H. Frank, “MusicLM: Generating music from text,” 2023.
[6] Z. Evans, C. Carr, J. Taylor, S. H. Hawley, and J. Pons, “Fast timing-conditioned latent audio diffusion,” arXiv, vol. abs/2402.04825, 2024.
[7] H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. P. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM : Text-to-audio generation with latent diffusion models,” vol. 202, pp. 21 450–21 474, 2023.
[8] H. Liu, Q. Tian, Y. Yuan, X. Liu, X. Mei, Q. Kong, Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley, “AudioLDM 2: Learning holistic audio generation with self-supervised pretraining,” arXiv, vol. abs/2308.05734, 2023.
[9] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020.
[10] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei, “Scaling instruction-finetuned language models,” arXiv, vol. abs/2210.11416, 2022.
[11] A. Ziv, I. Gat, G. L. Lan, T. Remez, F. Kreuk, A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “Masked audio generation using a single non-autoregressive transformer,” arXiv, vol. abs/2401.04577, 2024.
[12] J. Melechovsky, Z. Guo, D. Ghosal, N. Majumder, D. Herremans, and S. Poria, “Mustango: Toward controllable text-to-music generation,” arXiv, vol. abs/2311.08355, 2023.
[13] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and Classification of Acoustic Scenes and Events,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.
[14] Y. Liu, H. Sun, G. Chen, Q. Wang, Z. Zhao, X. Lu, and L. Wang, “Multi-level knowledge distillation for speech emotion recognition in noisy conditions,” arXiv, vol. abs/2312.13556, 2023.
[15] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “TinyBERT: Distilling BERT for natural language understanding,” in EMNLP, 2020, pp. 4163–4174.
[16] H. Shao, W. Wang, B. Liu, X. Gong, H. Wang, and Y. Qian, “Whisper-KDQ: A lightweight whisper via guided knowledge distillation and quantization for efficient ASR,” arXiv, vol. abs/2305.10788, 2023.
[17] S. Sun, Y. Cheng, Z. Gan, and J. Liu, “Patient knowledge distillation for bert model compression,” in EMNLP, 2019, pp. 4323–4332.
[18] W. Su, X. Chen, S. Feng, J. Liu, W. Liu, Y. Sun, H. Tian, H. Wu, and H. Wang, “ERNIE-Tiny : A progressive distillation framework for pretrained transformer compression,” arXiv, vol. abs/2106.02241, 2021.
[19] C. Lu, J. Zhang, Y. Chu, Z. Chen, J. Zhou, F. Wu, H. Chen, and H. Yang, “Knowledge distillation of transformer-based language models revisited,” arXiv, vol. abs/2206.14366, 2022.
[20] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter,” arXiv, vol. abs/1910.01108, 2019. [Online]. Available: http://arxiv.longhoe.net/abs/1910.01108
[21] H. Chang, S. Yang, and H. Lee, “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” pp. 7087–7091, 2022. [Online]. Available: https://doi.org/10.1109/ICASSP43922.2022.9747490
[22] X. Lv, P. Zhang, S. Li, G. Gan, and Y. Sun, “Lightformer: Light-weight transformer using svd-based weight transfer and parameter sharing,” in ACL, 2023, pp. 10 323–10 335.
[23] Y. Tay, M. Dehghani, J. Rao, W. Fedus, S. Abnar, H. W. Chung, S. Narang, D. Yogatama, A. Vaswani, and D. Metzler, “Scale efficiently: Insights from pre-training and fine-tuning transformers,” arXiv, vol. abs/2109.10686, 2021.
[24] S. Fort, H. Hu, and B. Lakshminarayanan, “Deep ensembles: A loss landscape perspective,” arXiv, vol. abs/1912.02757, 2019.
[25] H. Shimodaira, “Improving predictive inference under covariate shift by weighting the log-likelihood function,” Journal of Statistical Planning and Inference, vol. 90, pp. 227–244, 2000.
[26] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann, “Self-paced curriculum learning,” in AAAI, 2015, pp. 2694–2700.
[27] J. Shu, Q. ** for sample weighting,” in NeurIPS, 2019, pp. 1917–1928.
[28] T. Fang, N. Lu, G. Niu, and M. Sugiyama, “Rethinking importance weighting for deep learning under distribution shift,” in NeurIPS, 2020.
[29] J. Byrd and Z. C. Lipton, “What is the effect of importance weighting in deep learning?” in ICML, vol. 97, 2019, pp. 872–881.
[30] N. A. Smith and R. W. Tromble, “Sampling uniformly from the unit simplex,” Department of Computer Science / Center for Language and Speech Processing, Johns Hopkins University, 8 2004.
[31] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” arXiv, vol. abs/2210.13438, 2022.
[32] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms,” in Interspeech, 2019, pp. 2350–2354.
[33] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. W. Wilson, “CNN architectures for large-scale audio classification,” in ICASSP, 2017, pp. 131–135.
[34] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 28, pp. 2880–2894, 2020.
[35] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” arXiv, vol. abs/2401.04088, 2024.