On the effect of data-augmentation on local embedding properties in the contrastive learning of music audio representations

Abstract

Audio embeddings are crucial tools in understanding large catalogs of music. Typically embeddings are evaluated on the basis of the performance they provide in a wide range of downstream tasks, however few studies have investigated the local properties of the embedding spaces themselves which are important in nearest neighbor algorithms, commonly used in music search and recommendation. In this work we show that when learning audio representations on music datasets via contrastive learning, musical properties that are typically homogeneous within a track (e.g., key and tempo) are reflected in the locality of neighborhoods in the resulting embedding space. By applying appropriate data augmentation strategies, localisation of such properties can not only be reduced but the localisation of other attributes is increased. For example, locality of features such as pitch and tempo that are less relevant to non-expert listeners, may be mitigated while improving the locality of more salient features such as genre and mood, achieving state-of-the-art performance in nearest neighbor retrieval accuracy. Similarly, we show that the optimal selection of data augmentation strategies for contrastive learning of music audio embeddings is dependent on the downstream task, highlighting this as an important embedding design decision.

Index Terms— Music audio embeddings, data augmentation

1 Introduction

1.1 Overview

For industrial-scale music catalogs, audio embeddings are crucial in estimating perceptual and musical similarity between audio waveforms, and further, categorizing them into any taxonomy of human-readable labels. The dimensionality reduction from audio waveform to embedding ensures tasks may be addressed efficiently in terms of downstream model complexity, and pairwise embedding comparison. The generality of embeddings provides further computational efficiency in that large music catalogs may be analyzed just once, and thereafter adapted to any number of labeling tasks.

Recently, benefiting from the ability to learn from large amounts of unlabelled data, self-supervised contrastive audio embeddings have been shown to establish new state-of-the-art (SotA) results on a number of downstream tasks. Many use little to no data augmentation. COLA [1] established a new SotA on tasks concerning speech, music and environmental noise, by sampling positive pairs as distinct samples from audio regions of 10 seconds in length. [2], which employed only mix-up augmentation [3], further improved on many of these results. Other contrastive music embedding models are also trained without data-augmentation [4, 5].

In many cases data augmentation may not be necessary to increase the total coverage of music audio properties (e.g., pitch, tempo, production, etc.) in the dataset, as typically the datasets employed are large enough to include this variability naturally. However, this variability may not be seen within contrastive pairs. For example, the strategy of sampling pairs from local regions (e.g., $\pm 10$ s) of audio waveforms employed in these previous works likely causes low variability of certain properties such as key, tempo and equalization, within contrastive positive pairs. Due to this intra-pair homogeneity, we hypothesize that the local regions of the resulting embedding space will be similarly homogeneous. Yet, little research (e.g., [6]) has investigated the local properties of audio embedding spaces, and how audio characteristics are reflected therein. While performance on certain tasks such as pitch and key estimation proves the existence of pitch information within contrastively learned embeddings [7, 2], this might not necessarily guarantee the reliable emergence of the information within a local vicinity in the embedding space. The locality of this information in that space is important because such local properties are directly exposed in approximate nearest neighbor (ANN) algorithms which play important roles in efficient audio file identification, search, and recommendation, where the nearest neighbors of an audio query will have similarities determined by the local properties of the embedding space [8].

Local organization or manipulation of an embedding space by tempo [9], pitch, or key may be desirable, e.g., for professional music production or DJ applications where they are important. However, it may be problematic for consumer music applications where pitch, key and tempo can be less relevant to listeners than properties such as genre and instrumentation, that are in many cases only loosely correlated with the former. When the organization of embedding spaces by such properties is undesirable, we hypothesize that data augmentation strategies that increase the intra-pair variability of key, tempo and equalization, can reduce the sensitivity of embeddings to such properties relative to others, e.g., genre and instrumentation. Thus we expect the resulting embeddings to be more useful in nearest neighbor and labeling tasks concerning the latter.

1.2 Related Work

Various approximations of pitch shifting [10, 11, 12, 13, 14, 15, 16, 17], time stretching [16, 13, 12], and equalization [10, 11, 15] have been employed in contrastive learning. We note that time-stretching in the learning of contrastive music representations is particularly rare, with only [16, 12] employing this augmentation, and no results specifically demonstrating its efficacy. Furthermore, few studies investigate the efficacy of audio music embeddings in temporal downstream tasks such as beat/tempo estimation [18]. However, we find the effect of time-stretching augmentation to be particularly salient in this research.

In many studies data augmentation strategies are chosen based on their computational efficiency, and their suitability for the media at hand (e.g., for music, speech or environmental noise). Often data augmentation is decided on this basis prior to experimentation and left unchanged, although some studies performed ablation demonstrating some effects of data augmentation. For example, [10] studied the effect of individual augmentations on the Magnatagatune dataset [19], highlighting equalization as important in the contrastive learning of music embeddings from time domain audio, for this task. [11] found pitch shifting to be particularly important in the learning of speech representations. [14] studied the effect of pairs of augmentations on the Audioset dataset highlighting a combination of pitch shifting and mix-up to be effective.

While this work investigates the effect of augmentation on local embedding space properties, other studies have proposed designing supervised embedding spaces by training subspaces on specific concepts (e.g., genre, mood, era, instruments) via disentangled metric learning [20, 21]. These subspaces may be weighted to emphasize certain properties in downstream tasks, however, no results were provided as to the independence of each of these subspaces.

Due to the generality of audio embeddings and the number of possible downstream applications, this wide body of literature provides a patchwork of approaches to designing them for various downstream tasks. However, there are two studies we see missing from the literature. First, while there are some results demonstrating the effects of data augmentation on particular downstream tasks there are no results demonstrating the effect of data augmentation on local properties of the embedding space which are important for efficient ANN solutions employed in search and recommendation. Secondly, ablation studies on data augmentation for audio embeddings are based on the results from a single task, however, we expect data augmentation strategies to have task-dependent effects. In this work we address these areas by first demonstrating the effect of data augmentation on local embedding properties, and secondly, we show the subsequent effect of various data augmentation methods on downstream nearest neighbor and labelling tasks. We highlight time-stretching as an important and under-explored augmentation and achieve SotA performance on a number of tasks.

2 Methodology

Here we study the open-source Musicset-Unsupervised Large Embedding (MULE) model of [7] which employs a convolutional architecture first proposed in [2]. This provides a reproducible and performant baseline for self-supervised contrastive learning of music audio. It is trained via a familiar strategy of sampling contrastive pairs of mel-spectrograms that are located locally ( $\pm 5$ s) in music audio timelines. A similar strategy is common in the literature [10, 2, 7, 12, 4, 5, 1, 22] and has been shown to establish a new SotA in self-supervised representations for a number of tasks [7, 2]. NT-XEnt loss is employed [23] where the normalized temperature-scaled crossentropy of the cosine distance between positive pairs is minimized relative to all other examples in a given batch, which form the negative samples. Based on this loss, it is expected that characteristics that are typically common in positive pairs would be embedded locally in an embedding space with respect to cosine distance. Here we investigate three such properties: tempo, key and equalization. We employ a pipeline implementing the related data augmentation strategies of time-stretching, pitch-shifting and equalization, shown in Fig. 1. Specifically, we augment mel spectrograms of the form,

$X[u,m]=\log_{10}\left(\sum\limits_{k=0}^{k=K}S_{u}[k]\left|\sum\limits_{n=0}^{% n=N}{e^{-\frac{2\pi nkj}{K}}w[n]x[ml-n]}\right|\right)$ .

(1)

$S_{u}[k]$ for $0\leq u<U$ is a mel window at index $u$ according to HTK mel scaling [24]. $w[n]$ a windowing function of length $N$ , $l$ the hop size, and $K$ the DFT size. Parameters used here are identical to [7].

Refer to caption — Fig. 1: Sampling and augmentation pipeline diagram.

Time stretching augmentation (TS), defined as $X_{TS}[u,m]=h_{TS}(X[u,m];\tau)$ , is applied by resampling $X[u,m]$ using cubic spline interpolation at points $t=\tau m$ . We sample $\tau$ according to $\mathbf{\tau}\sim 1/(\tau\log{(1.5/0.75)})$ in range $[0.75,1.5]$ . To allow for $\tau>1.0$ , $X[u,m]$ is sampled with a context of $4.5$ seconds and then truncated to $3$ seconds following augmentation. While time-stretching is seldom applied in music representation learning [16, 12], we note that time-war**, where a randomly sampled $m$ will be shifted in time by resampling strategies was employed in [25, 16]. Further, one effect of the Random Resized Crop (RRC) augmentation of [13, 26, 27, 28] could be considered a time stretching augmentation, albeit coupled with a similar augmentation on the frequency axis.

Pitch shifting augmentation (PS), defined as $X_{PS}[u,m]=h_{PS}(X[u,m];\mu)$ , is applied via cubic spline interpolation at points:

f=S_{U}\log_{10}(1+\mu(10^{u/S_{U}}-1)),

(2)

with $S_{U}=U/\log_{10}(1+R/700)$ where $R$ is the sampling rate in Hz. This is equivalent to interpolating at points $\hat{f}=\mu\hat{m}$ on a linear frequency scale then translating to HTK mel scaling. $\mu$ is distributed as $\mathbf{\mu}\sim 1/(\mu\log{(1.335/0.749)})$ in the range $[0.749,1.335]$ , corresponding to a relative decrease / increase of up to 5 semitones. When $\mu<1.0$ , interpolated frequency bins are set to $0.0$ for $u>\mu U$ . We note that the pitch augmentation here has the imperfect effect of changing the bandwidth of harmonic partials,¹¹1Future work may use constant-Q features, thereby removing this artifact. yet is computationally efficient and a more accurate pitch shift operation than the translation operation of [14, 17], or the RRC augmentation in [13, 26, 27, 28].

Equalization augmentation (EQ) is achieved by applying randomly sampled lowpass and highpass filters to features,

X_{EQ}[u,m]=X[u,m]+\log_{10}\left(\sum\limits_{k=0}^{k=K}S_{u}[k]B_{f,c}[k]% \right),

(3)

where $B_{f,c}[k]$ refers to a third-order butterworth magnitude response, with corner frequency $f$ and highpass / lowpass selection $c\in\left\{0,1\right\}$ . We follow [10] where center frequencies are uniformly sampled from the range 2.2 kHz to 4 kHz for lowpass, and the range 200 Hz to 1.2 kHz for highpass filters. For each sample, we apply equal probabilities of highpass, lowpass and no filter.

In addition to TS, PS, EQ, we investigate the baseline RRC, and the combinations TSPSEQ and TSPS according to the order of operations in Fig. 1. The latter is similar to RRC, but with a more accurate pitch-shift operation. For each augmentation strategy, or combination thereof, we fine-tune all MULE layers using batches of $960$ example pairs, sampled and augmented in real-time from full track length timelines of a dataset of $1.7$ M music tracks. In all cases we train for $100$ k steps with a learning rate linearly ram** from $0$ to $0.001$ over $5$ k steps followed by a cosine decay to $0$ over the remaining $95$ k steps. All other hyperparameters are identical to [7].

3 Results

We consider three key categories of datasets – tempo, pitch and mixed labelling. For tempo, we consider the same collection of datasets in [29] plus the Harmonix dataset [30]. We hold out Gtzan [31], Giantsteps-Tempo [32, 33], and ACM-Mirum [34] as individual test sets, using the most recent annotations for each. Collectively, we refer to these splits as the AllTempo dataset. Other datasets employed include NSynth pitch and instrument datasets (NSynth_P, NSynth_I) [35], Giantsteps Key (GS_Key) [33], Gtzan [36], Magnatagatune (MTT) [19], and the Million Song Dataset (MSD) [37] (employing the commonly used variation of the 50 most common labels.)²²2https://github.com/jongpillee/music˙dataset˙split NSynth_P and GS_Key refer specifically to pitch-related information. In all cases we use the same partitions and filtering as in [7].

3.1 Local Embedding Properties

To demonstrate the effects of tempo / pitch augmentation on embedding spaces, we synthetically time-stretch all tracks in all AllTempo test partitions, and pitch-shift all tracks in the GS_Key dataset using Sox.³³3https://sox.sourceforge.net/ For each stretch / shift factor we measure the cosine distance between the track-average embeddings of the modified and unmodified versions. The mean and interquartile range across the respective datasets is shown in Fig. 2. For models trained without TS augmentation we observe musically sensical troughs in cosine distance at half and double time. Similarly, we observe that models trained without PS augmentation show musically sensical troughs in cosine distance at perfect 4th, perfect 5th and octave intervals, and peaks at minor 2nd and diminished 5th intervals. For each property the corresponding data augmentation removes both these musical properties, and mitigates the movement of the modified embeddings away from their unmodified version. In both cases PS augmentation increases embedding sensitivity to tempo, whilst TS augmentation increases embedding sensitivity to pitch. An important consequence is that these local embedding properties are directly reflected in nearest neighbor algorithms where, for two candidate tracks that are similar to an embedding query, the one that has a tempo very similar to the query is more likely to be selected. This may be undesirable in cases where the distance imposed by a difference in tempo or key outweighs the similarity of a property such as genre or mood, which may be more relevant to the query’s intent.

To validate the change in sensitivity of embedding spaces to pitch and tempo features of unmodified audio, we consider the average variability of tempo, tags and key in the neighborhoods of each track over all test partitions of the AllTempo, MSD, and the entirety of the GS_Key datasets. For tempo we collect the K-nearest neighbors in the cosine embedding space and calculate the root-mean-minimum-square (RMMS) distance between the closest tempo octave of $[1/3,1/2,1,2,3]$ of the seed track’s tempo and each of its K neighbors. To investigate the locality of key in a nearest neighbor setting, we look at the precision of retrieving key labels from the K-nearest neighbors of each seed track relative to it’s own tags. In all cases, metrics are averaged over all possible seeds in the dataset.

The results of these experiments are shown in Fig. 3. In Fig. 3(a) we see that all augmentations involving time stretching do indeed increase the variability of tempo in local neighborhoods of the embedding space, whereas PS augmentation slightly reduces this. Intuitively this is as we expect, and agrees with Fig. 2; if embeddings are less sensitive to one feature we expect them to be more sensitive to the remaining features due to reduced competition with the now-desensitized feature. Conversely, in Fig. 3(b), we see that embeddings trained with pitch augmentation notably decrease the precision (and hence consistency) of neighboring embedding keys relative to a query embedding, whereas TS augmentation notably increases this precision. It is promising to see tag precision of neighboring embeddings increases when applying any form of time-stretching augmentation, highlighting this augmentation to be an important factor in improving the local organization of embedding spaces by tags such as genre and key. We note that RRC augmentation is not as effective at increasing the locality of tags in the embedding space. In all cases we observe EQ augmentation to have little effect.

3.2 Nearest Neighbor Retrieval

The intention of embedding spaces that are desensitized to pitch, tempo and equalization is to improve the locality of properties that are relatively impartial to these, for downstream uses such as ANN retrieval. To evaluate, we follow the methodology in [20], over the same MSD test partition. We compute a tag retrieval metric by evaluating for each tag, the percentage of tracks with that tag that have it in common with any of their k-nearest neighbors. This is averaged over all tracks and tags. We also compute the precision at k between each seed track’s tag and the retrieved tags from its k-nearest neighbors, averaged over all seed tracks and tags. Results in Table 1 show MULE-based models outperform previous work, however, those tuned with TS augmentation perform significantly better than those without. Corroborating results in Section 3.1, RRC augmentation improves the grou** of embeddings by tags in local neighborhoods, however, it is not as effective as TS augmentation.

	Precision			Tag Retrieval
	$k$ =2	$k$ =4	$k$ =8	$k$ =1	$k$ =2	$k$ =4	$k$ =8
MULE	44.1	40.1	36.3	47.6	59.7	71.0	80.7
TS	48.0	44.1	39.9	51.9	63.3	73.9	82.5
PS	44.1	40.2	36.6	47.8	59.6	71.1	80.7
EQ	44.3	40.5	36.7	48.0	59.8	71.3	81.0
TSPS	48.1	44.1	40.0	51.8	63.2	73.7	82.4
TSPSEQ	48.1	44.1	40.1	51.6	63.4	73.8	82.5
RRC	46.6	42.9	39.4	50.3	61.8	72.8	81.9
[20]	-	-	-	45.0	58.5	71.0	80.9

Table 1: Nearest neighbor content based retrieval performance for different augmentations. Metric definitions are equivalent for

k=1

3.3 Music Labelling

Here, we look at the performance of embeddings over pitch, tempo and other labeling tasks to both confirm the loss of pitch / tempo information in embeddings trained with PS / TS augmentation, and the retention of other information such as genre and instrumentation.

For the AllTempo dataset we train a multi-layer perceptron as a 271 class classification problem for integer BPMs from 30 to 300 BPM. This probe has a single layer of 512 neurons and 75% dropout, trained with a batch size of 256 over 20k steps. Following [38], the tempo estimate is taken as the argmax of the classifier output after smoothing by a 15 BPM Hamming window. As there are no prior results on these datasets for generic music embeddings, we include two baselines: an end-to-end tempo convolutional network [39] and a SotA bespoke tempo and beat tracking model [29]. Congruent with the results in Section 3.1, Table 2 shows TS augmentation degrades performance on tempo tasks, while PS augmentation improves it. The latter of which exceeds the performance of [39], approaching the performance of [29]. This is promising considering that this baseline is highly task specific, whereas MULE is a generic music embedding that achieves excellent results in many other tasks.

For all other tasks we use the same probe training parameters as [7]. Table 3 shows that despite the heavy influence of tempo in the local organization of embedding spaces observed in Section 3.1, models with PS augmentation yield the greatest improvement on non-pitch dependent tasks (NSynth_I, Gtzan and MTT). Of these tasks, combinations of augmentation strategies achieve SotA on MTT. Interestingly, while RRC augmentation degrades performance on MTT, we see SotA performance on the smaller Gtzan dataset.

	Gtzan		Giantsteps		ACM-Mirum
	Acc1	Acc2	Acc1	Acc2	Acc1	Acc2
MULE	74.1	90.5	85.5	98.2	81.2	95.8
TS	54.6	66.2	71.0	84.1	66.2	76.9
PS	76.2	91.1	86.7	98.2	82.3	96.4
EQ	73.2	90.2	77.0	98.0	81.1	95.6
TSPS	64.1	78.0	78.8	92.3	76.5	87.7
TSPSEQ	60.4	75.5	56.9	75.0	68.7	79.9
RRC	54.0	67.1	49.2	69.7	63.0	73.7
[39]	76.9	92.6	82.1	97.1	78.1	97.6
[29]	83.0	95.0	87.0	96.5	84.1	99.0

Table 2: Acc 1 and Acc 2 metrics [40] on AllTempo test partitions.

	GS_Key	NSynth_P	NSnyth_I	Gtzan	MTT
	W. Acc	Acc	Acc	Acc	mAP	ROC
MULE	66.7	89.2	74.0	73.5	40.4	91.4
TS	67.2	89.0	74.8	77.2	40.4	91.5
PS	30.2	81.2	75.0	79.7	40.5	91.5
EQ	65.9	89.5	73.6	75.5	40.5	91.4
TSPS	18.5	79.3	73.2	81.0	40.8	91.7
TSPSEQ	17.6	79.7	75.3	80.3	40.9	91.7
RRC	14.9	76.9	73.9	82.8	37.1	89.2
SS SotA	67.3	94.4	78.2	81.1	40.9	91.4
SS SotA	[18]	[18]	[2]	[16]	[16]	[7]

Table 3: Tagging performance of augmentation pipelines with compared to no augmenation (MULE) and self-supervised (SS) SotA.

4 Conclusions

In this work we investigated intra-pair data augmentation on the local properties of contrastive music audio embedding spaces and showed its effect on downstream nearest neighbor / labelling tasks, with several key results. Firstly, we demonstrate that local neighborhoods in contrastively learned audio embedding spaces reflect local properties in the training data. Secondly, applying data augmentation mitigates the locality of related properties in the embedding space, while improving the locality of unrelated properties. Thirdly, the optimal selection of data augmentation strategies is task dependent, and careful selection results in SotA performance on downstream labelling / nearest neighbor tasks. Finally, we identify tempo as a key feature in the organization of contrastive music audio embedding spaces.

References

[1] A. Saeed, D. Grangier, and N. Zeghidour, “Contrastive learning of general-purpose audio representations,” in ICASSP, 2021, pp. 3875–3879.
[2] L. Wang, P. Luc, Y. Wu, A. Recasens, L. Smaira, A. Brock, A. Jaegle, J.-B. Alayrac, S. Dieleman, J. Carreira, et al., “Towards learning universal audio representations,” in ICASSP, 2022, pp. 4593–4597.
[3] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in ICLR, 2018.
[4] P. Alonso-Jiménez, X. Serra, and D. Bogdanov, “Music representation learning based on editorial metadata from discogs,” in ISMIR, 2022, pp. 825–833.
[5] P. Alonso-Jiménez, X. Favory, H. Foroughmand, G. Bourdalas, X. Serra, T. Lidy, and D. Bogdanov, “Pre-training strategies using contrastive learning and playlist information for music classification and similarity,” in ICASSP, 2023, pp. 1–5.
[6] J. Kim, J. Urbano, C. C. S. Liem, and A. Hanjalic, “Are nearby neighbors relatives? testing deep music embeddings,” Frontiers Appl. Math. Stat., vol. 5, pp. 53, 2019.
[7] M. C. McCallum, F. Korzeniowski, S. Oramas, F. Gouyon, and A. F. Ehmann, “Supervised and unsupervised learning of audio representations for music understanding,” in ISMIR, 2022, pp. 256–263.
[8] W. Min, K. Lu, and X. He, “Locality pursuit embedding,” Pattern Recognit., vol. 37, no. 4, pp. 781–788, 2004.
[9] M. C. McCallum, F. Henkel, J. Kim, S. Sandberg, and M. E. P. Davies, “Similar but faster: manipulation of tempo in music audio embeddings for tempo prediction and search,” in ICASSP, 2024.
[10] J. Spijkervet and J. A. Burgoyne, “Contrastive learning of musical representations,” in ISMIR, 2021, pp. 673–681.
[11] E. Kharitonov, M. Rivière, G. Synnaeve, L. Wolf, P.-E. Mazaré, M. Douze, and E. Dupoux, “Data augmenting contrastive learning of speech representations in the time domain,” in IEEE SLT Workshop, 2021, pp. 215–222.
[12] J. Choi, S. Jang, H. Cho, and S. Chung, “Towards proper contrastive self-supervised learning strategies for music audio representation,” in ICME, 2022, pp. 1–6.
[13] D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “BYOL for audio: Exploring pre-trained general-purpose audio representations,” IEEE/ACM TASLP, vol. 31, pp. 137–151, 2022.
[14] L. Wang and A. v. d. Oord, “Multi-format contrastive learning of audio representations,” arXiv:2103.06508, 2021.
[15] D. Yao, Z. Zhao, S. Zhang, J. Zhu, Y. Zhu, R. Zhang, and X. He, “Contrastive learning with positive-negative frame mask for music representation,” in ACM Web Conference, 2022, pp. 2906–2915.
[16] H. Zhao, C. Zhang, B. Zhu, Z. Ma, and K. Zhang, “S3T: Self-supervised pre-training with swin transformer for music classification,” in ICASSP. IEEE, 2022, pp. 606–610.
[17] A. Jansen, M. Plakal, R. Pandya, D. P. Ellis, S. Hershey, J. Liu, R. C. Moore, and R. A. Saurous, “Unsupervised learning of semantic audio representations,” in ICASSP, 2018, pp. 126–130.
[18] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin, C. Lin, A. Ragni, E. Benetos, N. Gyenge, et al., “MERT: Acoustic music understanding model with large-scale self-supervised training,” arXiv:2306.00107, 2023.
[19] E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie, “Evaluation of algorithms using games: The case of music tagging.,” in ISMIR, 2009, pp. 387–392.
[20] J. Lee, N. J. Bryan, J. Salamon, Z. **, and J. Nam, “Metric learning vs classification for disentangled music representation learning,” in ISMIR, 2020, pp. 439–445.
[21] J. Lee, N. J. Bryan, J. Salamon, Z. **, and J. Nam, “Disentangled multidimensional metric learning for music similarity,” in ICASSP, 2020, pp. 6–10.
[22] M. C. McCallum, “Unsupervised learning of deep features for music segmentation,” in ICASSP, 2019, pp. 346–350.
[23] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020, pp. 1597–1607.
[24] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, et al., “The HTK book,” Cambridge Univ. Eng. Dept., vol. 3, no. 175, pp. 12, 2002.
[25] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in ISCA, 2019, pp. 2613–2617.
[26] B. Nguyen, S. Uhlich, and F. Cardinaux, “Improving self-supervised learning for audio representations by feature diversity and decorrelation,” ICASSP, 2023.
[27] S. Ghosh, A. Seth, S. Umesh, and D. Manocha, “MAST: Multiscale audio spectrogram transformers,” ICASSP, 2022.
[28] E. Fonseca, D. Ortego, K. McGuinness, N. E. O’Connor, and X. Serra, “Unsupervised contrastive learning of sound event representations,” in ICASSP, 2021, pp. 371–375.
[29] S. Böck and M. E. P. Davies, “Deconstruct, analyse, reconstruct: How to improve tempo, beat, and downbeat estimation,” in ISMIR, 2020, pp. 574–582.
[30] O. Nieto, M. McCallum, M. E. P. Davies, A. Robertson, A. Stark, and E. Egozy, “The Harmonix set: Beats, downbeats, and functional segment annotations of western popular music,” in ISMIR, 2019, pp. 565–572.
[31] U. Marchand and G. Peeters, “Swing ratio estimation,” in DAFx, 2015, pp. 423–428.
[32] H. Schreiber and M. Müller, “A crowdsourced experiment for tempo estimation of electronic dance music.,” in ISMIR, 2018, pp. 409–415.
[33] P. Knees, Á. Faraldo Pérez, H. Boyer, R. Vogl, S. Böck, F. Hörschläger, M. Le Goff, et al., “Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections,” in ISMIR, 2015, pp. 364–370.
[34] G. Percival and G. Tzanetakis, “Streamlined tempo estimation based on autocorrelation and cross-correlation with pulses,” IEEE TASLP, vol. 22, no. 12, pp. 1765–1776, 2014.
[35] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musical notes with wavenet autoencoders,” in ICML, 2017, pp. 1068–1077.
[36] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE TSAP, vol. 10, no. 5, pp. 293–302, 2002.
[37] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere, “The million song dataset,” in ISMIR, 2011.
[38] S. Böck, M. E. P. Davies, and P. Knees, “Multi-task learning of tempo and beat: Learning one to improve the other.,” in ISMIR, 2019, pp. 486–493.
[39] H. Schreiber and M. Müller, “A single-step approach to musical tempo estimation using a convolutional neural network,” in ISMIR, 2018, pp. 100–105.
[40] F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle, and P. Cano, “An experimental comparison of audio tempo induction algorithms,” IEEE TASLP, vol. 14, no. 5, pp. 1832–1844, 2006.