\definechangesauthor

[name=Nikos, color=orange]nik \correspondingauthor[email protected] \affiliations¹NAVER LABS Europe - France
²University of Edinburgh - UK \contributionsTraining code: https://github.com/utter-project/fairseq
Pre-processing code: https://github.com/utter-project/mHuBERT-147-scripts
Model Weights: https://huggingface.co/utter-project/mHuBERT-147 \teaserfig Monolingual ASR Multilingual ASR LID Multilingual ASR + LID normal normal few-shot normal normal normal few-shot SSL # Params CER (\downarrow) CER (\downarrow) CER (\downarrow) ACC (\uparrow) ACC (\uparrow) CER (\downarrow) CER (\downarrow) SUPERBs (\uparrow) MMS-1B 965M 33.3 / 25.7 21.3 / 18.1 30.2 / 30.8 84.8 / 86.1 73.3 / 74.8 26.0 / 25.5 25.4 / 24.8 983.5 \scalerel* [Uncaptioned image] X / 948.1 \scalerel* [Uncaptioned image] X mHuBERT-147 95M 34.2 / 26.3 23.6 / 22.0 33.2 / 32.9 85.3 / 91.0 81.4 / 90.0 26.2 / 22.1 34.9 / 33.5 949.8 \scalerel* [Uncaptioned image] X / 950.2 \scalerel* [Uncaptioned image] X MMS-300M 317M 33.8 / 30.5 28.7 / 24.0 36.5 / 36.5 62.3 / 84.3 71.9 / 74.3 31.5 / 30.0 30.9 / 29.2 824.9 \scalerel* [Uncaptioned image] X / 844.3 NWHC1 317M 39.5 / 30.5 28.9 / 21.5 41.4 / 38.6 67.1 / 87.4 77.1 / 90.6 28.8 / 21.5 40.3 / 38.2 774.4 / 876.9 \scalerel* [Uncaptioned image] X NWHC2 317M 39.5 / 30.5 29.3 / 21.6 42.0 / 39.3 64.4 / 88.1 77.4 / 90.6 28.4 / 21.8 41.5 / 38.8 759.9 / 873.3 XLS-R-300M 317M 39.7 / 30.6 29.2 / 22.0 40.9 / 39.3 66.9 / 87.9 55.6 / 85.6 28.4 / 22.9 42.1 / 42.4 730.8 / 850.5 WavLabLM-large-MS 317M 40.5 / 32.8 37.8 / 31.9 43.8 / 42.8 71.7 / 81.1 70.8 / 80.0 37.0 / 32.2 43.4 / 41.2 707.5 / 740.9 \teasercaptionML-SUPERB 10min/1h leaderboard. Updated SOTA scores for each metric are presented in bold. The first ( \scalerel* [Uncaptioned image] X ), second ( \scalerel* [Uncaptioned image] X ) and third ( \scalerel* [Uncaptioned image] X ) best SUPERB scores are highlighted.

mHuBERT-147: A Compact Multilingual HuBERT Model

Abstract

We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.

1 Introduction

Self-supervised Learning (SSL) approaches for speech representation learning are the core foundation blocks of speech processing systems today. These models leverage large amounts of unlabeled speech data during training, and learn from different pretext tasks in order to build contextualized speech representations that can be leveraged for various downstream tasks. For English, several models have been proposed [38, 41, 3, 23, 8, 31], but most multilingual models available to the community are based on wav2vec 2.0 [13, 2, 40], with the only current exception being WavLabLM [10]. Meanwhile, the Hidden units BERT model (HuBERT, [23]) – where pre-training is performed in 2 or 3 iterations, and target training labels are externally obtained via k-means clustering – presents superior performance on the English SSL benchmark SUPERB [59], outperforming wav2vec 2.0. The English version of this model also shows decent cross-lingual adaptation capabilities on the multilingual benchmark ML-SUPERB [47]. Recently, HuBERT has also emerged as a popular choice for producing discrete speech units for multimodal LLMs [57, 6, 12].

However, despite recent efforts to reduce hardware costs [29, 30, 9], HuBERT’s superior performance comes with higher training costs – a characteristic that is further accentuated in multilingual settings requiring larger amounts of speech data. These cost demands arise from HuBERT’s multi-iteration training process, which contrasts with wav2vec 2.0’s single iteration training on unlabeled speech data alone. HuBERT requires high-dimensional feature extraction across the entire training dataset to generate discrete labels, along with a minimum of two model training and clustering steps, resulting in increased disk and CPU/GPU resource demands. We are aware of only small-scale multilingual HuBERT models that train using a single dataset: there is an mHuBERT trained on 3 languages [28]; and a HuBERT collection covering between 5 and 12 languages [18].

In contrast, in this work, we tackle the challenge of training the first general-purpose massively multilingual HuBERT speech representation model. This model is trained using 90,430 hours of clean open-license data across 147 languages. For reducing pre-processing costs, we downsample large popularly used speech datasets, hypothesizing that source diversity is more important than quantity. We also propose to replace the original HuBERT clustering implementation with faiss-based clustering [17], increasing label assignment speed by 5.2 times. Finally, for training, we employ a two-level multilingual up-sampling approach factoring in both linguistic and dataset diversity for increased overall multilingual performance.

After three training iterations, and with only 95M parameters, our compact mHuBERT-147 model outperforms multilingual wav2vec 2.0 models that are not only three times larger, but also trained with almost five times more data. It reaches respectively second and first place on the 10 min and 1 h leaderboard of the popular multilingual benchmark ML-SUPERB, with state-of-the-art (SOTA) scores for three out of four language identification (LID) tasks. Across all ML-SUPERB tasks and settings, mHuBERT-147 consistently surpasses XLS-R (300 M parameters; 436 K hours) – its most comparable multilingual wav2vec 2.0 model. It also demonstrates strong competitiveness against the much larger MMS (1 B parameters; 491 K hours). Complementary few-shot ASR evaluation on FLEURS-102 shows that our model is competitive to larger models, presenting impressive robustness at a compact size (70% less parameters). Our findings suggest that mHuBERT-147 is a promising compact model for speech processing pipelines, offering an unprecedented balance between high performance and parameter efficiency. We open-source all our scripts, manifest files and model weights.111See Appendix Section A.5.

2 mHuBERT-147 data collection

We gathered 90,430 hours of speech from datasets with permissive licences in 147 languages. For this multilingual collection, our goal was to prioritize linguistic diversity over data quantity alone. Table 1 lists the datasets we use along with their licences. Figure 1 illustrates data distribution across languages. In total, our training set spans 19 language families (sorted in decreasing order of data quantity): Indo-European, Niger-Congo, Uralic, Afro-Asiatic, Constructed (Esperanto), Turkic, Dravidian, Sino-Tibetan, Austronesian, Koreanic, Kra-Dai, Japonic, Language isolate (Basque), Kartvelian, Austroasiatic, Mongolic, Northwest Caucasian, Creole and Tupian. Appendix Tables 17-21 list all included languages with amount of data per dataset.

Section 2.1 details data pre-processing and filtering process. Section 2.2 discuss the overlap between the data we use and popular multilingual SSL models for speech representation learning.

Dataset Full Names and References # Languages # Hours (filtered) License
Aishells Aishell [4] and AISHELL-3 [49] 1 212 Apache License 2.0
B-TTS BibleTTS [25] 6 358 CC BY-SA 4.0
Clovacall ClovaCall [21] 1 38 MIT
CV Common Voice version 11.0 [1] 98 14,943 CC BY-SA 3.0
G-TTS
High quality TTS data for Javanese, Khmer,
Nepali, Sundanese, and Bengali Languages [50]
9 27 CC BY-SA 4.0
High quality TTS data for four South
African languages [54]
IISc-MILE IISc-MILE Tamil and Kannada ASR Corpus [32, 33] 2 406 CC BY 2.0
JVS Japanese versatile speech [52] 1 26 CC BY-SA 4.0
Kokoro Kokoro Speech Dataset [24] 1 60 CC0
kosp2e Korean Speech to English Translation Corpus [11] 1 191 CC0
MLS Multilingual LibriSpeech [39] 8 50,687 CC BY 4.0
MS MediaSpeech [26] 1 10 CC BY 4.0
Samrómur Samrómur Unverified 22.07 [51] 1 2,088 CC BY 4.0
TH-data THCHS-30 [56] and THUYG-20 [45, 46] 2 46 Apache License 2.0
VL VoxLingua107 [53] 107 5,844 CC BY 4.0
VP VoxPopuli [55] 23 15,494 CC0
Table 1: Datasets used for training mHuBERT-147 with their corresponding abbreviation used throughout the paper (left), amount of languages and hours (after filtering) and licenses. Downloading URLs are listed at Appendix Table 16.
Refer to caption
Figure 1: Speech amount per language in a logarithmic scale, with different levels of speech resourcefulness: \geq 800 h (blue), \geq 100 h (green), \geq 50 h (orange), \geq 10 h (red), \leq 10 h (purple). Some languages are excluded from the plot. Best seen in color.

2.1 Dataset pre-processing and filtering

We follow the default HuBERT pre-processing guidelines [23]. For all datasets, the speech data is converted to 16-bit 16kHz WAV files, with volume reduction of 5%percent55\%5 % applied for datasets in which excessive clip** is observed during conversion. Data is filtered to the interval of [2,30]230[2,30][ 2 , 30 ]s and any utterances outside this window are discarded. This is with the exception of the following, where file concatenation is performed: B-TTS (Akwapen Twi, Asante Twi, Hausa, Yoruba), IISc-MILE (Kannada), and JVS (Japanese). In these cases, we performed concatenation due to the already limited amount of speech samples we had available for training in those languages.

For TTS/ASR/ST datasets, only the training set is used, validation and test sets are always excluded to prevent data contamination. The complete manifest list can be found in our HuggingFace repository.222Manifest files available at: https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest We now detail dataset-specific pre-processing.333Pre-processing scripts available at: https://github.com/utter-project/mHuBERT-147-scripts/tree/main/01_dataset_preproc

  • VL: Manual inspection revealed that some language splits contained a significant amount of music, noise, and silence-only files. To improve the quality of the data used by our models, and to accurately estimate the amount of speech data per language, we filtered the dataset using the inaSpeechSegmenter tool [16]. In total, we removed over 332K potentially noisy utterances (249K music, 83K noise/silence). More details are given in Appendix Section C.2.

  • VP: For languages with ample representation in the multilingual collection (i.e., over 1,000h of speech - namely German, English, Spanish, French, and Dutch), we exclusively use the 10K subset of VP (2019 and 2020). For the other 18 languages, we utilize talks from 2017 to 2020 (100K VP split).

2.2 Data overlap with existing multilingual models

Table 2 shows the overlap in the training data of mHuBERT-147 and the multilingual wav2vec 2.0 [3] models: XLSR-53 [13], XLS-R [2] and MMS [40]. Looking at the table, we see that the model most similar to ours in terms of dataset overlap is XLS-R, with only BABEL not included in our training. This dataset is not freely available, and it comprises speech data in 17 low-resource African and Asian languages: Assamese, Bengali, Cantonese, Cebuano, Georgian, Haitian, Kazakh, Lao, Northern Kurdish, Pashto, Swahili, Tagalog, Tamil, Tok Pisin, Turkish, Vietnamese and Zulu. Of these, the only language we did not find an openly available dataset for was Zulu. Thus, mHuBERT-147 covers 127 of the 128 languages of XLS-R, while supporting 20 additional languages.

Dataset # Langs XLSR-53 XLS-R MMS mHuBERT-147
BABEL [19] 17
CV v2 38
CV v6 60
CV v9 89
CV v11 98
MLS 8
VP 23 down-sampled
VL 107 down-sampled
MMS-lab-U [40] 1,362
Aishells, B-TTS, Clovacall 23
G-TTS, IIScMILE, JVS
Kokoro, kosp2e, MS
Samrómur, TH-data
Table 2: Datasets included in the training of the multilingual models most comparable to the mHuBERT-147 model. We highlight that although here we consider previous versions of CV as being entirely comprised on later versions, this is a simplification, and some files might have been removed due to speakers requests or down-votes.

3 mHuBERT-147 training

Our training follows the multi-iteration pre-training objective of HuBERT [23]. This approach leverages an external acoustic unit discovery model (k-means clustering) to provide frame-level targets for masked span prediction pretraining. For optimization, two weighted cross-entropy losses are computed over masked (Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT) and unmasked (Lusubscript𝐿𝑢L_{u}italic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT) spans, defined as L=ψLm+(1ψ)Lu𝐿𝜓subscript𝐿𝑚1𝜓subscript𝐿𝑢L=\psi L_{m}+(1-\psi)L_{u}italic_L = italic_ψ italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + ( 1 - italic_ψ ) italic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and Lusubscript𝐿𝑢L_{u}italic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT can both be defined using the generic loss function in Equation 1. Here, f𝑓fitalic_f refers to the masked prediction model. X=[x1,..,xT]X^{\prime}=[x_{1},..,x_{T}]italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] is the masked speech utterance of frame length T𝑇Titalic_T. M{1,..,T}M\subset{\{1,..,T\}}italic_M ⊂ { 1 , . . , italic_T } are the masked indices in X’ for masked loss (Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT), and the unmasked indices otherwise (Lusubscript𝐿𝑢L_{u}italic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT). Z=[zi,..,zT]Z=[z_{i},..,z_{T}]italic_Z = [ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , . . , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] are the discrete labels from clustering. Finally, Pfsubscript𝑃𝑓P_{f}italic_P start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT produces the distribution over the target indices at each time-step t𝑡titalic_t. The authors of HuBERT highlight that higher values of ψ𝜓\psiitalic_ψ result in a model analogous to language modeling, where the prediction model f𝑓fitalic_f is forced to learn both the unmasked segments and the long-range temporal structure of the speech data.

L(f;X,M,Z)=tMlogPf(zt|X,t)𝐿𝑓superscript𝑋𝑀𝑍subscript𝑡𝑀subscript𝑃𝑓conditionalsubscript𝑧𝑡superscript𝑋𝑡L(f;X^{\prime},M,Z)=\sum_{t\in M}\log P_{f}(z_{t}|\leavevmode\nobreak\ X^{% \prime},t)italic_L ( italic_f ; italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_M , italic_Z ) = ∑ start_POSTSUBSCRIPT italic_t ∈ italic_M end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t ) (1)

Next, we dive into how mHuBERT-147 differs from HuBERT training, including a) a more complex data up-sampling strategy (3.1), and b) a replacement of the original k-means implementation with the efficient faiss [17] Inverted File Index (IVF) – drastically increasing labeling speed (3.2). Finally, we discuss our experimental setup (3.3).

3.1 Two-level language-source up-sampling

To optimize for exposure to different languages and data sources during training, we employ a two-level up-sampling strategy during multilingual batching. Let N𝑁Nitalic_N be the total number of examples in the training set, with nlsubscript𝑛𝑙n_{l}italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT corresponding to the count for language l𝑙litalic_l. The sampling probability for l𝑙litalic_l is Pl(nlN)αproportional-tosubscript𝑃𝑙superscriptsubscript𝑛𝑙𝑁𝛼P_{l}\propto\left(\frac{n_{l}}{N}\right)^{\alpha}italic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∝ ( divide start_ARG italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT, where α𝛼\alphaitalic_α is a hyper-parameter in [0,1]01[0,1][ 0 , 1 ]; with α=1𝛼1\alpha=1italic_α = 1 resulting in no up-sampling, and lower values resulting in higher probabilities for under-represented languages.

For each epoch, we sample N times from the probability distributions Plsubscript𝑃𝑙P_{l}italic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. In this way we reach a quantity Blsubscript𝐵𝑙B_{l}italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT of examples selected per language l𝑙litalic_l. We sample Blsubscript𝐵𝑙B_{l}italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT utterances by considering varied data sources (datasets). The probability of sampling an utterance of l𝑙litalic_l from data source x𝑥xitalic_x is given by Px(nl(x)nl)βproportional-tosubscript𝑃𝑥superscriptsubscript𝑛𝑙𝑥subscript𝑛𝑙𝛽P_{x}\propto\left(\frac{n_{l}(x)}{n_{l}}\right)^{\beta}italic_P start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∝ ( divide start_ARG italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT, where nl(x)subscript𝑛𝑙𝑥n_{l}(x)italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x ) corresponds to the number of examples of language l𝑙litalic_l from data source x𝑥xitalic_x, and β𝛽\betaitalic_β is a hyper-parameter in [0,1]01[0,1][ 0 , 1 ]. We sort the N𝑁Nitalic_N selected utterances by length before batching, to minimize random crop**.

The most similar up-sampling strategy to ours is MMS – which considers (language, dataset) pairs as unique languages for up-sampling. However, their approach inflates the training time allocated to high-resource languages, as these tend to have more diverse data sources in the training set. Unlike them, we adopt a two-level technique to better suit our diverse dataset collection.

3.2 Efficient clustering with faiss indices

We use faiss-based [17] k-means clustering for faster label assignment. This library facilitates efficient similarity search and clustering of dense vectors. We cluster using Inverted File Index (IVF) with the following setting: OPQM_D,IVFK_HNSW32,PQMx4fsr, that we now detail.444More information can be found at: https://github.com/facebookresearch/faiss/wiki/The-index-factory

  • OPQM_D,..,PQMx4fsr” is used for indexing RAM usage. This option optimizes vector representation by rotation and then performs input vector projection into dimension D. Product quantization (PQ) is then applied to hash the vectors into M 4-bit codes, resulting in the storage of M/2 bytes per vector. We use M=16,D=64formulae-sequence𝑀16𝐷64M=16,D=64italic_M = 16 , italic_D = 64.

  • ..,IVFK_HNSW32,..” denotes the indexing itself. IVF performs coarse quantization via an efficient implementation of k-means. This option uses Hierarchical Navigable Small Worlds (HNSW) graphs for cluster assignment, with 32 being the number of links per vertex. In practice, this greatly speeds up clustering (5.2x faster than Hsu et al. [23]).

3.2.1 Faiss vs sklearn k-means

Our motivation to replace the sklearn k-means by faiss comes from our struggle to scale this approach to a multilingual setting with more data and a higher K. For instance, working with a 850 GB set of vectors of dimensionality 768, we were unable to finish the clustering step after four days, while faiss implementation was able to produce an index in 40 h. Moreover, label application is 5.2 times faster by using faiss indices compared to sklearn clustering. This speed up in label application is particularly relevant for our work, as we have a massive training set.

We verified that this new clustering setting does not impact downstream model quality by pretraining two baselines: a two-iteration English HuBERT following the original codebase, and a faiss-based equivalent model with the same number of centroids. Both pretrained models were then fine-tuned on the Librispeech [35] 100 h ASR, performing similarly on clean/other test sets. Appendix Section A.1 presents our detailed results and discussion.

3.3 Experimental setting

3.3.1 Faiss training

For clustering, we also sample data using the two-level up-sampling approach, to ensure alignment with the training data. We maximize the data quantity used for clustering by sampling with a RAM budget of 850 GB, which corresponds to over 6 M examples (20.8%) for iteration 1, and over 630 K examples (around 2%) for iteration 2 and 3. The reduced example count for later iterations is due to the features being high-dimensional (dim=768𝑑𝑖𝑚768dim=768italic_d italic_i italic_m = 768) compared to iteration 1 MFCCs (dim=39𝑑𝑖𝑚39dim=39italic_d italic_i italic_m = 39). We use K=1000𝐾1000K=1000italic_K = 1000 across all iterations.555Code available at: https://github.com/utter-project/mHuBERT-147-scripts/tree/main/03_faiss_indices/

3.3.2 Pre-training setting

We build our codebase on top of fairseq [34], introducing data loading optimizations and a new multilingual batching setting (Section 3.1).666Fork available at: https://github.com/utter-project/fairseq We train HuBERT base models (95 M parameters) following the original recipe’s hyperparameters but with a larger batch size (2.8 M instead of 1.4 M frames), input normalization, and scaling to more updates: 2 M steps (approximately 31 epochs), instead of the original 400 K. For our two-level up-sampling strategy, we select α=0.7𝛼0.7\alpha=0.7italic_α = 0.7 (language up-sampling); and β=0.9𝛽0.9\beta=0.9italic_β = 0.9. (source up-sampling). Feature extraction for iteration 2 and 3 is performed using respectively the 6th and the 9th layer of the previous iteration last checkpoint. Training one iteration requires 32x A100-80GB GPUs running in 4 parallel nodes for approximately 20 days. More details regarding training are presented in Appendix Section A.4.

3.3.3 Hardware considerations

mHuBERT-147 is trained on over 90 K hours of speech in 147 languages – considerably lesser than its most comparable multilingual model XLS-R (128 languages, 436 K hours), and the larger scale MMS (1,406 languages, 491 K hours). Indeed, data quantity is a major bottleneck for HuBERT, since the training process requires discrete labels for the entire training dataset which, in turn, are produced from high-dimensional features. Extracting these features in a typical multi-iteration process can incur prohibitive expenses for massive datasets, in terms of both storage requirements and pre-processing times. For instance, the features we extracted for mHuBERT-147 occupied 48 TB of storage per iteration. We estimate that for the training set of XLS-R, this would take about 270 TB. While storage requirements can be reduced by loo** the feature extraction and labeling processes, this would increase pre-processing times. Appendix Section A.2 details hardware requirements for the different pre-processing and training steps.

4 ML-SUPERB evaluation

Monolingual ASR Multilingual ASR LID Multilingual ASR + LID
normal normal few-shot normal normal normal few-shot
SSL # Params CER (\downarrow) CER (\downarrow) CER (\downarrow) ACC (\uparrow) ACC (\uparrow) CER (\downarrow) CER (\downarrow) SUPERBs (\uparrow)
MMS-1B 965M 33.3 / 25.7 21.3 / 18.1 30.2 / 30.8 84.8 / 86.1 73.3 / 74.8 26.0 / 25.5 25.4 / 24.8 983.5 \scalerel* [Uncaptioned image] X / 948.1 \scalerel* [Uncaptioned image] X
NWHC1 317M 39.5 / 30.5 28.9 / 21.5 41.4 / 38.6 67.1 / 87.4 77.1 / 90.6 28.8 / 21.5 40.3 / 38.2 774.4 / 876.9 \scalerel* [Uncaptioned image] X
NWHC2 317M 39.5 / 30.5 29.3 / 21.6 42.0 / 39.3 64.4 / 88.1 77.4 / 90.6 28.4 / 21.8 41.5 / 38.8 759.9 / 873.3
mHuBERT-147-3rd 95M 34.2 / 26.3 23.6 / 22.0 33.2 / 32.9 85.3 / 91.0 81.4 / 90.0 26.2 / 22.1 34.9 / 33.5 949.8 \scalerel* [Uncaptioned image] X / 950.2 \scalerel* [Uncaptioned image] X
mHuBERT-147-2nd 95M 35.9 / 27.6 25.4 / 22.5 34.2 / 33.8 74.8 / 90.1 81.0 / 89.0 26.3 / 23.6 33.9 / 34.4 895.0 / 925.7
MMS-300M 317M 33.8 / 30.5 28.7 / 24.0 36.5 / 36.5 62.3 / 84.3 71.9 / 74.3 31.5 / 30.0 30.9 / 29.2 824.9 \scalerel* [Uncaptioned image] X / 844.3
XLS-R-300M 317M 39.7 / 30.6 29.2 / 22.0 40.9 / 39.3 66.9 / 87.9 55.6 / 85.6 28.4 / 22.9 42.1 / 42.4 730.8 / 850.5
WavLabLM-large-MS 317M 40.5 / 32.8 37.8 / 31.9 43.8 / 42.8 71.7 / 81.1 70.8 / 80.0 37.0 / 32.2 43.4 / 41.2 707.5 / 740.9
Table 3: ML-SUPERB 10min/1h results. Current SOTA [48] is shown on the top portion of the table, our submission is shown in the middle, other relevant multilingual models are presented below it. Updated SOTA scores for each metric are presented in bold. The first ( \scalerel* [Uncaptioned image] X ), second ( \scalerel* [Uncaptioned image] X ) and third ( \scalerel* [Uncaptioned image] X ) best SUPERB scores are highlighted.

For evaluating the quality of the multilingual representations learned by mHuBERT-147, we use the ML-SUPERB benchmark [47]. This benchmark comprises two settings: 10min and 1h; and four tasks: monolingual ASR; multilingual ASR, LID, and joint multilingual ASR and LID. The monolingual setting constitutes 13 (language, domain) pairs, while the multilingual one has 240 pairs across 143 languages in total – of which 123 and 20 constitute the normal (10min/1h per language) and few-shot (5 utterances per language) evaluation settings respectively.

Setup.

The downstream architecture consists of learnable weights over the frozen SSL features, a CNN for reducing feature dimensionality by a half, and two Transformer layers (dim=256𝑑𝑖𝑚256dim=256italic_d italic_i italic_m = 256; 8 attention heads). In line with the official guidelines,777Recipe at: https://github.com/espnet/espnet/tree/master/egs2/ml_superb/asr1 we only tune the learning rates (best of 1e3;1e4;1e51𝑒31𝑒41𝑒51e-3;1e-4;1e-51 italic_e - 3 ; 1 italic_e - 4 ; 1 italic_e - 5) on the validation set.888Appendix B.1 presents more information regarding optimization. Due to the high compute costs of evaluating on this benchmark, we do not retrain the other submissions we compare against and instead reuse the leaderboard scores from Shi et al. [48]. Therefore, we are unfortunately unable to provide confidence intervals, as these were not originally requested by the benchmark organizers. We also recompute the global SUPERB scores for all models following Shi et al. [47], since this metric leverages SOTA scores for normalization, and our model achieves new SOTA on three tasks (bold values in Table 3).999Our SUPERB score calculator is available at: https://github.com/utter-project/mHuBERT-147-scripts/blob/main/06_evaluation/mlsuperb/compute_superb_score.py

Results.

Table 3 presents results for the 10min/1h settings. The current SOTA models from Shi et al. [48] are shown at the top (NWHC models are variants of MMS-300M). Other relevant multilingual SSL models are displayed in the bottom portion: MMS-300M; XLS-R-300M, and the best WavLabLM model from Chen et al. [10] (136 languages; 40 K hours). We highlight that although XLS-R and MMS are referred in the literature as “300 M”, the correct parameter count is 317 M. We omit XLSR-53 results as these are consistently worse than XLS-R [47]. We present results for both our second (2nd) and third (3rd) mHuBERT-147 iterations.
Looking at the bottom rows, we notice that mHuBERT-147 outperforms the XLS-R and WavLabLM models across all tasks starting from its 2nd iteration. Compared to MMS-300M, both 2nd and 3rd iterations of our model are only surpassed in three tasks – few-shot ASR+LID 10min/1h, and monolingual ASR 10min – and by a small margin. These results highlight that the mHuBERT-147 training approach produces effective output starting from the 2nd iteration. They also motivate the interest in conducting a 3rd iteration, and we observe further improvement over all tasks between mHuBERT-147-2nd and mHuBERT-147-3rd scores.
Focusing on our best mHuBERT-147 model, and comparing it against the current SOTA (top rows), we see that this model is again very competitive, despite being trained on much lesser data. Our compact mHuBERT-147-3rd reaches the first position of the leaderboard for the 10min setting, and the second position for the 1h setting, being outperformed only by a model ten times larger (MMS-1B). We also set new SOTA ACC scores for LID 10min, LID 1h and Multilingual ASR+LID 10min, while being the only 95 M parameters model at the top of the leaderboard.

5 FLEURS-102 evaluation

We complement the ML-SUPERB evaluation by training monolingual ASR models on the FLEURS-102 dataset [14], competing with XLS-R and MMS (300M/1B). In this full fine-tuning few-shot setting, we simply add a linear projection to the target vocabulary on top of the pre-trained stack, optimizing the resulting model for ASR using approximately 10 hours of speech. Thus, unlike the experiments in the previous section, larger models will now have the benefit of having more parameters for adaptation. With these experiments, we want to illustrate that even in this unfavorable setting, mHuBERT-147 can still be competitive, while being faster at fine-tuning and inference time.

Setup.

We implement monolingual ASR using the transformers library [60]. All models were trained for 100 epochs on fp32 using V100-32GB/A100-80GB GPUs. Since individual language optimization would be prohibitively costly for this dataset, we select a subset of 29 languages, covering the different geographic groups, and optimize parameters using XLS-R. We use 1e51𝑒51e-51 italic_e - 5 as learning rate, warm-up ratio of 0.10.10.10.1 and dropout of 0.10.10.10.1 (300 M and 1 B) or 0.30.30.30.3 (95 M). The increased dropout for the latter is due to the ASR models being considerably smaller (70% less parameters). We train three models per language with different seeds, adding up to four runs in cases of high variability in scores (20absent20\geq 20≥ 20 CER). We apply MMS transcript normalization [40], reporting CER averages over the two best runs.

Results.

Table 4 presents results grouped by FLEURS-102 geographic groups: Western Europe (WE); Eastern Europe (EE); Central-Asia/Middle-East/North-Africa (CMN); Sub-Saharan Africa (SSA); South-Asia (SA); South-East Asia (SEA); Isolates (CJK). Despite its small size, mHuBERT-147 achieves the best average over 102 languages in this setting, mainly due to its superior performance in the CMN, SA, and SEA groups. Further investigation reveals this superior performance is due to its greater robustness to few-shot adaptation compared to other models. For instance, compared to MMS-1B (See Appendix Table 11), mHuBERT-147 consistently shows higher CER averages, likely due to its smaller capacity. However, MMS-1B fails to achieve useful transcription (CER>90absent90>90> 90) for 16 languages, while mHuBERT-147 only fails for half of these.101010Removing the aforementioned 16 languages from the computed average, we find that MMS-1B indeed produces the best FLEURS-102 scores (MMS-1B: 8.2, mHuBERT-147: 13.3). In other words, its overall superior performance is due to its representations converging more consistently in few-shot settings compared to other models. Appendix Section B.2 present more information regarding the FLEURS-102 results.

SSL
General
Avg (102)
WE
(25)
EE
(16)
CMN
(12)
SSA
(20)
SA
(14)
SEA
(11)
CJK
(4)
MMS-1B 22.3 17.4 11.0 37.8 23.3 27.7 25.9 17.8
MMS-300M 24.9 19.5 12.5 39.8 24.8 29.1 29.5 35.8
XLS-R-300M 24.5 18.7 11.8 39.2 24.8 29.7 29.5 33.4
mHuBERT-147 21.1 18.7 15.3 23.4 22.7 25.5 19.8 31.5
Table 4: FLEURS-102 CER (\downarrow) geographic group averages, with number of languages between parentheses.
Training and Inference efficiency.

We measured fine-tuning (1 epoch) and inference efficiency (test set) across three runs and two languages (English and Kannada) using a RTX 3090-24GB in exclusive node execution mode. We find that mHuBERT-147, which has 70% less parameters, is respectively 1.8 and 3 times faster to fine-tune on average than 300M and 1B models. For test inference, our model is respectively 1.4 and 2 times faster on average than 300M and 1B models.

Overall, our results highlight our model as a compact but powerful solution for multilingual speech processing applications. Appendix Section B.3 present complementary results for the ESB benchmark.

6 Conclusion

In this paper we presented the first general-purpose multilingual HuBERT model. For training, we prioritized data quality over quantity – selecting diverse data sources and filtering noisy data, and defining a two-level multilingual batching up-sampling approach that considers both language and dataset diversity. For clustering, we leveraged faiss-optimized clustering for decreasing the label assignment costs. Despite being trained with considerably less data than popular multilingual models, and being a compact model, mHuBERT-147 is competitive with larger multilingual SSL models, reaching respectively second and first positions at ML-SUPERB 10min/1h leaderboards, and setting new SOTA for three LID metrics. Complementary ASR results on FLEURS-102 suggest that mHuBERT-147 is a promising model for multilingual speech downstream tasks, offering an unprecedented balance between high performance and parameter efficiency.

7 Acknowledgments

We thank the ML-SUPERB authors for providing us with assistance and clarification during our models evaluation. We would also like to thank our NAVER LABS Europe colleagues Caroline Brun and Salah Aït-Mokhtar for their assistance with the FLEURS-102 experiments.

This work was funded by the European Union’s Horizon Europe (HE) Research and Innovation programme under Grant Agreement No 101070631 and from the UK Research and Innovation (UKRI) under the UK government’s HE funding grant No 10039436. This document is licensed under CC-BY-4.0.

[Uncaptioned image] [Uncaptioned image]

References

  • Ardila et al. [2020] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215, 2020.
  • Babu et al. [2022] Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, and Michael Auli. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. In Proc. Interspeech 2022, pages 2278–2282, 2022.
  • Baevski et al. [2020] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  • Bu et al. [2017] Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pages 1–5, 2017.
  • Carletta [2007] Jean Carletta. Unleashing the killer corpus: experiences in creating the multi-everything ami meeting corpus. Language Resources and Evaluation, 41:181–190, 2007.
  • Chang et al. [2023] Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, **chuan Tian, Shinji Watanabe, et al. Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study. arXiv preprint arXiv:2309.15800, 2023.
  • Chen et al. [2021] Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021.
  • Chen et al. [2022] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, **yu Li, Naoyuki Kanda, Takuya Yoshioka, and Xiong Xiao. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
  • Chen et al. [2023a] William Chen, Xuankai Chang, Yifan Peng, Zhaoheng Ni, Soumi Maiti, and Shinji Watanabe. Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute. In Proc. INTERSPEECH 2023, pages 4404–4408, 2023a.
  • Chen et al. [2023b] William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, and Shinji Watanabe. Joint prediction and denoising for large-scale multilingual self-supervised learning. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023b.
  • Cho et al. [2021] Won Ik Cho, Seok Min Kim, Hyunchang Cho, and Nam Soo Kim. kosp2e: Korean Speech to English Translation Corpus. In Proc. Interspeech 2021, pages 3705–3709, 2021.
  • Chou et al. [2023] Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, and Michael Auli. Toward joint language modeling for speech units and text. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6582–6593, Singapore, 2023. Association for Computational Linguistics.
  • Conneau et al. [2021] Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Proc. Interspeech 2021, pages 2426–2430, 2021.
  • Conneau et al. [2023] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE, 2023.
  • Del Rio et al. [2022] Miguel Del Rio, Peter Ha, Quinten McNamara, Corey Miller, and Shipra Chandra. Earnings-22: A practical benchmark for accents in the wild. arXiv preprint arXiv:2203.15591, 2022.
  • Doukhan et al. [2018] David Doukhan, Jean Carrive, Félicien Vallet, Anthony Larcher, and Sylvain Meignier. An open-source speaker gender detection framework for monitoring gender equality. In Acoustics Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. IEEE, 2018.
  • Douze et al. [2024] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. 2024.
  • Duquenne et al. [2023] Paul-Ambroise Duquenne, Hongyu Gong, Ning Dong, **gfei Du, Ann Lee, Vedanuj Goswami, Changhan Wang, Juan Pino, Benoît Sagot, and Holger Schwenk. SpeechMatrix: A large-scale mined corpus of multilingual speech-to-speech translations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16251–16269, Toronto, Canada, 2023. Association for Computational Linguistics.
  • Gales et al. [2014] Mark JF Gales, Kate M Knill, Anton Ragni, and Shakti P Rath. Speech recognition and keyword spotting for low-resource languages: Babel project research at cued. In Fourth International workshop on spoken language technologies for under-resourced languages (SLTU-2014), pages 16–23. International Speech Communication Association (ISCA), 2014.
  • Gandhi et al. [2022] Sanchit Gandhi, Patrick Von Platen, and Alexander M Rush. Esb: A benchmark for multi-domain end-to-end speech recognition. arXiv preprint arXiv:2210.13352, 2022.
  • Ha et al. [2020] Jung-Woo Ha, Kihyun Nam, **gu Kang, Sang-Woo Lee, Sohee Yang, Hyunhoon Jung, Hyeji Kim, Eunmi Kim, Soo** Kim, Hyun Ah Kim, Kyoungtae Doh, Chan Kyu Lee, Nako Sung, and Sunghun Kim. ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers. In Proc. Interspeech 2020, pages 409–413, 2020.
  • Hernandez et al. [2018] François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Esteve. Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20, pages 198–208. Springer, 2018.
  • Hsu et al. [2021] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  • Iida [2022] Katsuya Iida. Kokoro Speech Dataset. https://github.com/kaiidams/Kokoro-Speech-Dataset, 2022. [Online; accessed 21-Feb-2024].
  • Josh Meyer and David Adelani and Edresson Casanova and Alp Öktem and Daniel Whitenack and Julian Weber and Salomon Kabongo Kabenamualu and Elizabeth Salesky and Iroro Orife and Colin Leong and Perez Ogayo and Chris Chinenye Emezue and Jonathan Mukiibi and Salomey Osei and Apelete AGBOLO and Victor Akinode and Bernard Opoku and Olanrewaju Samuel and Jesujoba Alabi and Shamsuddeen Hassan Muhammad [2022] Josh Meyer and David Adelani and Edresson Casanova and Alp Öktem and Daniel Whitenack and Julian Weber and Salomon Kabongo Kabenamualu and Elizabeth Salesky and Iroro Orife and Colin Leong and Perez Ogayo and Chris Chinenye Emezue and Jonathan Mukiibi and Salomey Osei and Apelete AGBOLO and Victor Akinode and Bernard Opoku and Olanrewaju Samuel and Jesujoba Alabi and Shamsuddeen Hassan Muhammad. BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus. In Proc. Interspeech 2022, pages 2383–2387, 2022.
  • Kolobov et al. [2021] Rostislav Kolobov, Olga Okhapkina, Andrey Platunov Olga Omelchishina, Roman Bedyakin, Vyacheslav Moshkin, Dmitry Menshikov, and Nikolay Mikhaylovskiy. Mediaspeech: Multilanguage asr benchmark and dataset, 2021.
  • Lagos and Calapodescu [2024] Nikolaos Lagos and Ioan Calapodescu. Unsupervised multi-domain data selection for asr fine-tuning. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10711–10715, 2024.
  • Lee et al. [2022] Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen Chen, Changhan Wang, Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, and Wei-Ning Hsu. Textless speech-to-speech translation on real data. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 860–872, Seattle, United States, 2022. Association for Computational Linguistics.
  • Li et al. [2022] Pengcheng Li, Genshun Wan, Fenglin Ding, Hang Chen, Jianqing Gao, Jia Pan, and Cong Liu. Improved speech pre-training with supervision-enhanced acoustic unit. arXiv preprint arXiv:2212.03482, 2022.
  • Lin et al. [2023] Tzu-Quan Lin, Hung-yi Lee, and Hao Tang. Melhubert: A simplified hubert on mel spectrograms. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
  • Liu et al. [2020] Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6419–6423. IEEE, 2020.
  • Madhavaraj Ayyavu and Bharathi Pilar and A.G. Ramakrishnan [2022a] Madhavaraj Ayyavu and Bharathi Pilar and A.G. Ramakrishnan. Subword dictionary learning and segmentation techniques for automatic speech recognition in tamil and kannada, 2022a.
  • Madhavaraj Ayyavu and Bharathi Pilar and A.G. Ramakrishnan [2022b] Madhavaraj Ayyavu and Bharathi Pilar and A.G. Ramakrishnan. Knowledge-driven subword grammar modeling for automatic speech recognition in tamil and kannada, 2022b.
  • Ott et al. [2019] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
  • Panayotov et al. [2015a] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015a.
  • Panayotov et al. [2015b] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015b.
  • Parcollet et al. [2024] Titouan Parcollet, Ha Nguyen, Solène Evain, Marcely Zanon Boito, Adrien Pupier, Salima Mdhaffar, Hang Le, Sina Alisamir, Natalia Tomashenko, Marco Dinarelli, Shucong Zhang, Alexandre Allauzen, Maximin Coavoux, Yannick Estève, Mickael Rouvier, Jerôme Goulian, Benjamin Lecouteux, François Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, and Laurent Besacier. Lebenchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of french speech. Computer Speech and Language, 86:101622, 2024.
  • Pascual et al. [2019] Santiago Pascual, Mirco Ravanelli, Joan Serrà, Antonio Bonafonte, and Yoshua Bengio. Learning problem-agnostic speech representations from multiple self-supervised tasks. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, pages 161–165, 2019.
  • Pratap et al. [2020] Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. MLS: A Large-Scale Multilingual Dataset for Speech Research. In Proc. Interspeech 2020, pages 2757–2761, 2020.
  • Pratap et al. [2023] Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516, 2023.
  • Ravanelli et al. [2020] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, João Monteiro, Jan Trmal, and Yoshua Bengio. Multi-task self-supervised learning for robust speech recognition. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6989–6993, 2020.
  • Ravanelli et al. [2021] Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. SpeechBrain: A general-purpose speech toolkit, 2021. arXiv:2106.04624.
  • Renals et al. [2007] Steve Renals, Thomas Hain, and Hervé Bourlard. Recognition and understanding of meetings the ami and amida projects. In 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pages 238–247. IEEE, 2007.
  • Rivière et al. [2020] Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel Dupoux. Unsupervised pretraining transfers well across languages. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7414–7418, 2020.
  • Roze et al. [2015] Askar Roze, Shi Yin, Zhiyong Zhang, Dong Wang, and Askar Hamdulla. Thugy20: A free uyghur speech database. In NCMMSC’15, 2015.
  • Rozi et al. [2015] Askar Rozi, Dong Wang, and Zhiyong Zhang. An open/free database and benchmark for uyghur speaker recognition. In O-COCOSDA’15, 2015.
  • Shi et al. [2023a] Jiatong Shi, Dan Berrebbi, William Chen, En-Pei Hu, Wei-** Huang, Ho-Lam Chung, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung yi Lee, and Shinji Watanabe. ML-SUPERB: Multilingual Speech Universal PERformance Benchmark. In Proc. INTERSPEECH 2023, pages 884–888, 2023a.
  • Shi et al. [2023b] Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-** Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, et al. Findings of the 2023 ml-superb challenge: Pre-training and evaluation over more languages and beyond. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023b.
  • Shi et al. [2015] Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. 2015.
  • Sodimana et al. [2018] Keshan Sodimana, Knot Pipatsrisawat, Linne Ha, Martin Jansche, Oddur Kjartansson, Pasindu De Silva, and Supheakmungkol Sarin. A Step-by-Step Process for Building TTS Voices Using Open Source Data and Framework for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese. In Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), pages 66–70, Gurugram, India, 2018.
  • Staffan Hedström, Ragnheiður Þórhallsdóttir, David Erik Mollberg, Smári Freyr Guðmundsson, Ólafur Helgi Jónsson, Sunneva Þorsteinsdóttir, Judy Y. Fong, Eydís Huld Magnúsdóttir, Jon Gudnason [2022] Staffan Hedström, Ragnheiður Þórhallsdóttir, David Erik Mollberg, Smári Freyr Guðmundsson, Ólafur Helgi Jónsson, Sunneva Þorsteinsdóttir, Judy Y. Fong, Eydís Huld Magnúsdóttir, Jon Gudnason. Samrómur Unverified 22.07. Reykjavik University: Language and Voice Lab, 2022.
  • Takamichi et al. [2019] Shinnosuke Takamichi, Kentaro Mitsui, Yuki Saito, Tomoki Koriyama, Naoko Tanji, and Hiroshi Saruwatari. Jvs corpus: free japanese multi-speaker voice corpus. arXiv preprint arXiv:1908.06248, 2019.
  • Valk and Alumäe [2021] Jörgen Valk and Tanel Alumäe. VoxLingua107: a dataset for spoken language recognition. In Proc. IEEE SLT Workshop, 2021.
  • van Niekerk et al. [2017] Daniel van Niekerk, Charl van Heerden, Marelie Davel, Neil Kleynhans, Oddur Kjartansson, Martin Jansche, and Linne Ha. Rapid development of TTS corpora for four South African languages. In Proc. Interspeech 2017, pages 2178–2182, Stockholm, Sweden, 2017.
  • Wang et al. [2021] Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 993–1003, Online, 2021. Association for Computational Linguistics.
  • Wang et al. [2015] Dong Wang, Xuewei Zhang, and Zhiyong Zhang. Thchs-30 : A free chinese speech corpus. 2015.
  • Wang et al. [2023] Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, **yu Li, and Furu Wei. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107, 2023.
  • Watanabe et al. [2018] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai. ESPnet: End-to-end speech processing toolkit. In Proceedings of Interspeech, pages 2207–2211, 2018.
  • wen Yang et al. [2021] Shu wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung yi Lee. SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198, 2021.
  • Wolf et al. [2020] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, 2020. Association for Computational Linguistics.

Appendix A Pre-training experiments

A.1 Faiss-based clustering

For evaluating our hypothesis that the mini-batch k-means proposed in the original HuBERT recipe can be replaced with the more efficient faiss IVF clustering (refer Section 3.2), we trained two HuBERT base models [23] from scratch - one with k-means clustering and the other with faiss indices. We follow the recipe available in the fairseq library [34]111111Available at: https://github.com/facebookresearch/fairseq/blob/main/examples/hubert/README.md for training the HuBERT models. The models are trained for two iterations on the LibriSpeech dataset [36]. We use the same number of centroids as the original recipe – K=100𝐾100K=100italic_K = 100 in the first iteration, and K=500𝐾500K=500italic_K = 500 in the second.121212Model hyper-parameters are available at: https://github.com/facebookresearch/fairseq/blob/main/examples/hubert/config/pretrain/hubert_base_librispeech.yaml

Clustering performance.

We find that, in this setting, both approaches take the same time to train their labels when using Librispeech train split with K=500𝐾500K=500italic_K = 500 in the second iteration (2.33 hours). However, as mentioned previously (Section 3.2.1), after four days, we failed to finish training a sklearn-based k-means using more data (850 GB) and higher K (K=1000𝐾1000K=1000italic_K = 1000). For faiss, the same setting took approximately 40 hours.

Labeling performance.

Clustering application using sklearn takes 1.8 min per 10 h of speech, whereas it takes faiss only 21 s to label the same amount of speech data. This makes the former approach 5.2 times slower. The speed up in the latter is mainly due to the Hierarchical Navigable Small Worlds (HNSW) optimized graph search algorithm.

LibriSpeech Results.

We train models on the 100 h split from LibriSpeech, using the Hugging Face transformers library [60]. We train for 30 epochs on fp16 precision, adding a linear projection to the target vocabulary on top of the pre-trained stacks. We explore two fine-tuning approaches: all (wherein all weights are updated) or weighted (wherein only downstream layers and a weighted sum over the pre-trained representations are updated). We tune the learning rate for each model (best of 1e2;1e3;1e4;1e51𝑒21𝑒31𝑒41𝑒51e-2;1e-3;1e-4;1e-51 italic_e - 2 ; 1 italic_e - 3 ; 1 italic_e - 4 ; 1 italic_e - 5), selecting the best result using valid-clean. We train models twice, varying the random seed, and present average results. Table 5 presents results for the original HuBERT model131313Available at: https://huggingface.co/facebook/hubert-base-ls960 (original), our reproduction of the same model (sklearn), and the faiss-based clustering version (faiss).
We observe that both models trained from scratch (sklearn and faiss) present a slight decrease in performance compared to the original pre-trained model. We believe this could be due to differences in data selection for the clustering steps, which is performed at random using 20% of the training data. Our models are also trained using the random crop batching approach, while the original implementation might have used padding.141414The shared recipe code mentions that “v1” HuBERT used padding, but we could not find information in the paper regarding this. Finally, focusing on models trained under the same settings, we observe that performance for sklearn and faiss models is very similar. This suggests that faiss IVFs can be leveraged for HuBERT training without performance degradation. In Appendix Section A.4.2 we illustrate that in multilingual settings, and by increasing K for faiss, we also produce equally performing speech representation models by using either sklearn or faiss clustering.

FT test-clean (\downarrow) test-other (\downarrow)
original all 7.1 15.3
sklearn 8.4 17.1
faiss 7.9 17.1
original weighted 48.8 57.1
sklearn 50.5 59.0
faiss 50.2 59.2
Table 5: Average WER test results for the Librispeech dataset using different English 2nd iteration HuBERT models.

A.2 Hardware considerations

In this section, we detail the different hardware requirements for training our mHuBERT-147 model. Table 6 summarizes the key information. The next few paragraphs elaborate on these costs.

Task RAM/job CPU/job GPU/job Total Disk
Total Processing
Time
0 Speech - - - 9.6 TB -
1 Speech Features iteration 1 - - - 4.7 TB -
1 Speech Features iteration 2-3 50-500 GB - 1x V100-32 GB 48 TB 3-5 days
2 faiss Index Training 2 TB
32x Intel(R) Xeon(R)
Platinum 8452Y
- 3.6 GB 2 days
3 faiss Index Application 50-500 GB 4-8 cores (any) - 37 GB 2-3 days
4 Manifest/Labels - - - 65 GB -
5 Model training (iteration) 2 TB/node
64 x AMD EPYC
7313 16-Core Processor
32x A100-80GB 35 GB 20 days
Table 6: Overview of the hardware requirements necessary per job (RAM; CPU; GPU), total disk required to store the output of the tasks, and total estimated processing time. The numbers in the left illustrate the hierarchy of processes, with larger numbers depending on the completion of the previous processes.
Data storage requirements.

For storing mHuBERT-147’s 90 K hours of speech, 9.6 TB of disk storage are required. For the first iteration, an additional 4.7 TB is required for storing the extracted MFCC features for clustering and label application. For the second and third iterations, substantially more storage is needed – as the 39-dimensional features are replaced by 768-dimensional vectors. As a result, our storage requirements increased roughly 10 times, reaching about 48 TB. Fortunately, the storage of feature vectors is not a requirement for training, and so in our pre-processing protocol, we performed iterative extraction for labeling, avoiding having more than 10 TB of features vectors at a given moment on our clusters. Thus, we can consider that the upper bound of storage required to train a mHuBERT-147 model with 90 K hours is around 58 TB, and we can thus estimate that training a model with the same amount of speech data as XLS-R could necessitate up to 280 TB of storage space. While one can cut storage costs during pre-processing, this comes with an increase in pre-processing time. The iteration between extraction and labeling, followed by feature deletion, might also extend the duration of GPU idle time between consecutive iterations.

GPU requirements for pre-processing.

GPUs are required to extract features for the entire corpus, as these are needed to produce the target labels. The cost of this extraction can be greatly reduced with data sharding. On average, 1 h of speech vectors requires 9 s in a V100-32GB. Considering 12 parallel jobs running in V100-32GB GPU, with optimum sharding 90 K hours of speech data could be processed in approximately 19 h. However, this is an over-optimistic lower-bound, as sharding will never be optimal unless it is performed cross-dataset (mixing data from different datasets during extraction process), which makes data management a more complex task. Moreover, this also requires the upper-bound of storage requirements discussed above. For mHuBERT-147, we shard feature extraction only for datasets containing over 20,000 utterances for a given language, so as to allow us to more effectively parallelize the labeling task. This results in an approximate extraction time of 3 to 5 days for the entire mHuBERT-147 training corpus.

Clustering requirements.

In the original HuBERT paper, this step is performed by clustering with mini-batch k-means. In Section 3.2 we discuss replacing this approach with faiss indices, that are much more scalable. We sample training data using the same distribution used for up-sampling during multilingual training (Section 3.1), and by having a fixed RAM target of around 850 GB, which allows us to train indices in 2 TB RAM machines.151515The faiss index creation implementation requires almost twice the amount of RAM used for loading the data in order to train its indices. Index training using the “OPQ16_64,IVF1000_HNSW32,PQ16x4fsr” option takes approximately 40 h running on exclusivity mode on a node with 32 Intel(R) Xeon(R) Platinum 8452Y and 2 TB RAM. Index application (i.e. labeling of speech feature vectors) takes 21 s per 10 h of speech transformed into numpy feature vectors, costing about 53.65 processing hours for the totality of the training data. Index application is a RAM-bound process, where loading of the numpy features occupies the majority of the processing time, and thus, with increased sharding and parallel processing, index application takes negligible processing time.

Training requirements.

Training was performed using the NAVER Cloud platform.161616https://www.ncloud.com/ We use 32x A100-80 GB GPUs running on 4 parallel nodes. Regarding RAM requirements, considering 8 parallel data loaders, manifest lists and label offsets stored in memory, 1.5 TB of RAM is needed. While kee** labels directly in RAM would increase batching speed, this would require an additional 2 TB of RAM. Storage needs are limited to the speech data (9.6 TB), corresponding manifest and label files (65 GB) and checkpoints (1.5 GB per checkpoint).

A.3 Loss curves and training instability

The loss curves for the three mHuBERT-147 iterations are presented in Figure 2. We observe that while the loss decrease from the 1st to the 2nd iteration is very pronounced, it seems to be more limited between the 2nd and the 3rd iterations. We hypothesize that the model might be starting to saturate at the end of the 3rd iteration, and might need extra capacity for improving further.

Refer to caption
Figure 2: Loss curves for the 3 iterations of mHuBERT-147 training. For the 3rd iteration, the step jump from 1.8M to 0 is due to the optimizer re-initialization: this model crashed in fp16 and had to be reinitialized using fp32 for the last 200 K updates. Best seen in color.

Different from our previous experiences training wav2vec 2.0 models, we find that our mHuBERT-147 models tend to be more stable during training. We do, however, reach training instability at the end of the 3rd iteration. For solving gradient explosion, similarly to wav2vec 2.0 (see discussion in Parcollet et al. [37]), we find that the most effective solution is increasing floating point precision and restarting the optimizer from the previous learning rate.

A.4 Pre-training hyper-parameters

In this section, we cover the different hyper-parameters chosen for the training of mHuBERT-147, comparing intermediate checkpoints obtained using different settings. As a proxy evaluation of the quality of the speech representations obtained, we use the 1 h phoneme recognition task from Rivière et al. [44]. This setting allows us to investigate the emergence of multilingual phonemic content on our models through different iterations and training updates.

Proxy evaluation.

We train monolingual few-shot phoneme recognition systems using the 10 available languages: Spanish (es), French (fr), Italian (it), Kyrgyz (ky), Dutch (nl), Russian (ru), Swedish (sv), Turkish (tt), Tatar (tt), Chinese (zh-HK). We follow the standard data split, using 1 h for training, validation and test. Our models are trained using the Speechbrain library [42], with a standard CTC fine-tuning approach employing two different learning rates: 1e51𝑒51e-51 italic_e - 5 for the pre-trained SSL model, and 1e41𝑒41e-41 italic_e - 4 for the preceding 2 linear layers and output projection. We use a dropout of 0.6 and train for 100 epochs.171717Recipe at: https://github.com/utter-project/mHuBERT-147-scripts/tree/main/06_evaluation/CV_PR.

Refer to caption
Figure 3: Linear interpolation of average PER performance of mHuBERT-147 across different iterations, and with different number of updates. The bold scores correspond to scores obtained at the following updates, from left to right: 400 K, 1 M, 1.5 M and 2 M.
Average
PER (\downarrow)
es fr it ky nl ru sv tr tt zh-HK
sklearn (K=500𝐾500K=500italic_K = 500) 14.8 10.8 13.7 15.0 11.5 15.8 16.6 19.2 13.2 9.5 22.3
faiss (K=1000𝐾1000K=1000italic_K = 1000) 14.6 10.7 13.6 14.8 10.9 15.5 16.4 19.3 13.2 9.4 22.5
random crop 14.4 10.4 12.9 14.1 10.9 14.5 16.2 19.3 13.1 9.4 22.8
padding 23.6 18.0 23.7 22.9 20.1 25.4 25.0 29.7 23.0 17.1 30.9
Table 7: PER scores for the proxy task. Models on top and bottom portions are not comparable due to different up-sampling settings and maximum number of updates (top: 800 K, bottom: 2 M). Top portion: 1st iteration mHuBERT-147 models (α=0.5𝛼0.5\alpha=0.5italic_α = 0.5; β=0.8𝛽0.8\beta=0.8italic_β = 0.8) trained for 800 K updates using different clustering implementations: sklearn mini-batch k-means (K=500𝐾500K=500italic_K = 500), or faiss IVF (K=1000𝐾1000K=1000italic_K = 1000). Bottom portion: 1st iteration faiss-based mHuBERT-147 models (α=0.7𝛼0.7\alpha=0.7italic_α = 0.7; β=0.9𝛽0.9\beta=0.9italic_β = 0.9) trained for 400 K updates using random crop or padding as batching approach.

A.4.1 Impact of number of updates and iterations

We train our iterations for 2 M updates, which is considerably larger than the standard approach (between 100-500 K updates [23, 28]). In this section, we present evidence that pre-training longer is necessary for improved multilingual representation learning.

Figure 3 presents mHuBERT-147 proxy performance across different iterations, and on varying number of updates.181818Linear interpolation is used to build our plot as evaluation across updates did not always fetch checkpoints corresponding to the exact same number of updates. Nonetheless, we highlight in bold the points at which checkpoints were evaluated across all iterations. Results are average PER scores computed across ten languages. We highlight performance at 400 K, 1 M, 1.5 M and 2 M.

Despite the loss curve looking stable after 400 K (Figure 2), we observe that the models keep improving internal multilingual phonemic representation until the very end of the 2 M updates. This is true for all iterations but the 3rd, where we achieve for the first time a best score that does not correspond to the last checkpoint: 8.1 PER at update 1.975 M. We believe this is probably an artifact of the multilingual batching not favoring the specific subset of languages present in the proxy evaluation during the very last training batches. This can also be a result linked to model saturation.

Focusing on proxy results for the 1st and 2nd iterations, we observe that after 400 K updates, the 2nd iteration model produces scores similar to that at the end of the 1st iteration. A longer training allows this model to yield an overall performance improvement of 21.6% over the best score for the 1st iteration. Regarding the 3rd iteration, we observe that the proxy task improvement is not as significant compared to the 2nd iteration: the 3rd iteration model only surpasses the best 2nd iteration scores after 1.5 M updates. This hints at the saturation of the multi-iteration HuBERT training approach at the 3rd step. We believe that more capacity (i.e. going from base to large architecture) could allow the model to improve more substantially at this point.

Finally, we note that although this proxy task allows us to assess the model’s potential during training, its scores are not completely correlated to the final model’s downstream performance, mostly due to the limitation in covered languages (only 10). For instance, although the 3rd iteration seems to improve only 0.5 PER over the second iteration, we observe significant improvement of ML-SUPERB scores when using the 3rd over the 2nd iteration model (Section 4).

A.4.2 K-means versus faiss on multilingual settings

We train two 1st iteration mHuBERT-147 models for 800 K updates, varying the K-means implementation between the sklearn (K=500𝐾500K=500italic_K = 500) and faiss (K=1000𝐾1000K=1000italic_K = 1000). We select K=500𝐾500K=500italic_K = 500 for sklearn because, as mentioned in Section 3.2.1, we were unable to train sklearn K-means models for our dataset using larger K values. Comparing models with different K values could result in a more favorable setting for faiss, but we believe this comparison can still be made using 1st iteration models. This is because at this step, clustering is performed using MFCCs, which are low-dimensional. Thus, the extra expressivity that faiss-based clustering has at this iteration should not have a significant impact in the quality of the produced discretization.

The top portion of Table 7 presents the results using our proxy evaluation setting for the two pre-trained 1st iteration models.191919We highlight that these scores are not comparable to the ones in Section A.4.1, as for this experiment, the up-sampling is more aggressive (α=0.5𝛼0.5\alpha=0.5italic_α = 0.5; β=0.8𝛽0.8\beta=0.8italic_β = 0.8). We once again observe that faiss-based mHuBERT-147 training does not result in inferior speech representation modeling.

A.4.3 Padding versus random crop

We investigate the impact of random crop during multilingual training, compared to the more classical approach of padding. We train a first iteration padded mHuBERT-147 model for 400 K updates, and compare it to the model using random crop at the same training state (same number of seen examples). The bottom portion of Table 7 presents results for our proxy task. We observe that speech representation learning equipped with random crop batching mechanism outperforms quite drastically the padding approach at lower number of updates (400 K). We believe this is due to the higher example diversity for the random crop setting.

A.4.4 Two-level language-source up-sampling

The hyper-parameters for the two-level language-source up-sampling were chosen considering the language-source distribution in our data collection. The following are our observations regarding these hyper-parameters.

Language up-sampling.

Aggressive language up-sampling will result in a better multilingual balance in terms of language exposure, but it can result in poor utterance diversity during a given epoch. This issue comes from the fact that low-resource languages have a limited amount of examples, and thus aggressive up-sampling will allocate a large chunk of the batches for the same examples. While repeating examples in low-resource languages is the goal of up-sampling, we find that over-exposure to the same limited number of examples will result in overall multilingual speech representation degradation. For this reason, during our hyper-parameter search, we moved from α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 (41% of a given epoch allocated to repeated examples) to α=0.7𝛼0.7\alpha=0.7italic_α = 0.7 (20% of a given epoch allocated to repeated examples).

Source up-sampling.

We find that this parameter is more important for high-resource languages, as low-resource languages are often single-sourced in our multilingual dataset. As we filtered overly large datasets for high-resource languages prior to training (Section 2), we employ a mild source up-sampling.

A.5 Open-source links

The mHuBERT-147 models are made available under the CC-BY-NC-4.0 license. Our model release include the faiss trained IVFs, the fairseq checkpoint and the corresponding HuggingFace checkpoint.

Appendix B Downstream experiments

B.1 ML-SUPERB experiments

For evaluation on ML-SUPERB, we follow the recipe provided in the ESPnet library [58]. We use their Hugging Face interface for fine-tuning our mHuBERT-147 model.202020More information available at: https://github.com/espnet/espnet/tree/master/egs2/ml_superb/asr1 We do implement an ML-SUPERB score calculator, since we could not find one provided by the organizers.212121Available at: https://github.com/utter-project/mHuBERT-147-scripts/blob/main/06_evaluation/mlsuperb/compute_superb_score.py

Learning Rate.

Following the official ML-SUPERB guidelines, we only optimize the learning rate parameter (best of 1e3;1e4;1e51𝑒31𝑒41𝑒51e-3;1e-4;1e-51 italic_e - 3 ; 1 italic_e - 4 ; 1 italic_e - 5). Table 8 presents the list of learning rates selected by evaluating models on the validation set. The scores presented in Table 3 correspond to models trained with the corresponding learning rate.
We highlight that our experimentation showed that it is quite important to evaluate models with multiple learning rates. We show test LID results in Table 9 for mHuBERT-147-3rd trained using different learning rates. By varying the learning rate, we go from mediocre LID results to SOTA scores for this ML-SUPERB task.

Monolingual ASR Task.

Table 10 present test CER results per language for mHuBERT-147-3rd, in both 10 min and 1 h settings. For each language, we train three models, varying the learning rate. We use the development set to select the final model for our submission (see Table 8). During training, we observed many instabilities for this particular setting, with models sometimes failing to converge. In these cases, we find that restarting training with a new random seed fixed the problem. We highlight that seeds vary across languages.

Monolingual
ASR
Multilingual
ASR
LID
Multilingual
ASR+LID
10 min
1e-3
(GROUP A)
1e-3 1e-4 1e-3
1e-4
(GROUP B)
1 h 1e-3 1e-3 1e-4 1e-4
Table 8: Selected learning rates for mHuBERT-147-3rd using validation scores for the corresponding tasks. For monolingual ASR, GROUP A corresponds to: eng1, eng2, eng3, deu1, rus, swa, jpn, cmn, and GROUP B corresponds to: fra1, fra2, deu2, swe, xty.
ACC (\uparrow)
10 min 1e-3 49.68
1e-4 85.33
1e-5 69.56
1 h 1e-3 51.06
1e-4 90.95
1e-5 81.35
Table 9: ML-SUPERB 10 min and 1 h LID ACC test scores for mHuBERT-147-3rd by varying the learning rate for training.
Final
Score
eng1 eng2 eng3 fra1 fra2 deu1 deu2 rus swa swe jpn cmn xty
10min 34.2 30.9 42.3 37.0 43.2 51.0 25.2 36.9 26.4 24.7 29.9 13.9 34.7 63.0
1h 26.3 22.6 33.6 27.3 29.5 36.1 20.2 24.3 20.5 19.8 21.6 10.4 23.8 57.9
Table 10: Detailed mHuBERT-147-3rd test CER (\downarrow) results for the languages inside the ML-SUPERB monolingual track. The final score is obtained by averaging all runs belonging to the same language, and then averaging this score across all different languages.

B.2 FLEURS-102 experiments

Table 11 present average CER results for all languages in FLEURS-102.

Failed ASR models.

We find that MMS and XLS-R models fail to converge to useful translation (CER>90absent90>90> 90) for 16 languages (ara, ceb, ckb, cym, fas, ful, heb, hrv, hye, mri, nya, oci, snd, som, tam, tel). For mHuBERT-147, this only happens for half of these languages (cym, fas, hye, nya, oci, snd, som, tel). There is no instance where mHuBERT-147 fails to converge but other models successfully converge.

Seen versus unseen languages.

We use the model closest to ours in language coverage (XLS-R) to compare model performance for seen and unseen languages. Table 12 presents the results. We observe that while being much more compact, mHuBERT-147 presents overall better performance in both seen and unseen languages.

afr amh ara asm ast aze bel ben bos bul cat ceb
MMS-1B 9.5±plus-or-minus\pm±0.02 7.9±plus-or-minus\pm±0.13 99.1±plus-or-minus\pm±0.0 9.8±plus-or-minus\pm±0.02 5.7±plus-or-minus\pm±0.13 6.0±plus-or-minus\pm±0.01 4.4±plus-or-minus\pm±0.1 7.0±plus-or-minus\pm±0.07 5.0±plus-or-minus\pm±0.03 4.2±plus-or-minus\pm±0.02 4.5±plus-or-minus\pm±0.01 86.8±plus-or-minus\pm±6.43
MMS-300M 12.5±plus-or-minus\pm±0.09 9.4±plus-or-minus\pm±0.13 98.7±plus-or-minus\pm±0.37 12.4±plus-or-minus\pm±0.08 7.0±plus-or-minus\pm±0.01 7.8±plus-or-minus\pm±0.03 5.8±plus-or-minus\pm±0.0 8.0±plus-or-minus\pm±0.1 6.5±plus-or-minus\pm±0.08 5.6±plus-or-minus\pm±0.05 6.1±plus-or-minus\pm±0.08 99.3±plus-or-minus\pm±0.0
XLS-R 12.0±plus-or-minus\pm±0.73 10.6±plus-or-minus\pm±0.31 99.1±plus-or-minus\pm±0.0 12.6±plus-or-minus\pm±0.07 6.4±plus-or-minus\pm±0.02 8.0±plus-or-minus\pm±0.15 5.6±plus-or-minus\pm±0.04 8.4±plus-or-minus\pm±0.0 5.7±plus-or-minus\pm±0.09 4.8±plus-or-minus\pm±0.18 5.2±plus-or-minus\pm±0.04 99.3±plus-or-minus\pm±0.0
mHuBERT-147 15.3±plus-or-minus\pm±0.74 12.1±plus-or-minus\pm±0.21 9.8±plus-or-minus\pm±0.16 15.1±plus-or-minus\pm±0.18 10.2±plus-or-minus\pm±0.24 11.3±plus-or-minus\pm±0.33 8.0±plus-or-minus\pm±0.11 9.6±plus-or-minus\pm±0.13 9.3±plus-or-minus\pm±0.13 8.5±plus-or-minus\pm±0.05 8.4±plus-or-minus\pm±0.02 57.0±plus-or-minus\pm±43.03
ces ckb cmn cym dan deu ell eng est fas fil fin
MMS-1B 4.6±plus-or-minus\pm±0.07 97.4±plus-or-minus\pm±1.82 17.9±plus-or-minus\pm±0.58 99.3±plus-or-minus\pm±0.0 9.1±plus-or-minus\pm±0.11 3.9±plus-or-minus\pm±0.02 6.1±plus-or-minus\pm±0.35 6.4±plus-or-minus\pm±0.26 3.5±plus-or-minus\pm±0.04 99.6±plus-or-minus\pm±0.0 4.4±plus-or-minus\pm±0.11 2.9±plus-or-minus\pm±0.06
MMS-300M 6.8±plus-or-minus\pm±0.04 98.0±plus-or-minus\pm±1.24 30.1±plus-or-minus\pm±0.07 99.3±plus-or-minus\pm±0.0 12.4±plus-or-minus\pm±0.02 6.0±plus-or-minus\pm±0.1 8.3±plus-or-minus\pm±0.2 8.8±plus-or-minus\pm±0.4 4.4±plus-or-minus\pm±0.15 99.6±plus-or-minus\pm±0.0 5.4±plus-or-minus\pm±0.14 3.7±plus-or-minus\pm±0.02
XLS-R-300M 5.7±plus-or-minus\pm±0.08 94.9±plus-or-minus\pm±4.32 27.9±plus-or-minus\pm±1.63 99.3±plus-or-minus\pm±0.0 11.1±plus-or-minus\pm±0.22 5.0±plus-or-minus\pm±0.04 7.5±plus-or-minus\pm±0.24 7.8±plus-or-minus\pm±0.11 3.9±plus-or-minus\pm±0.16 96.0±plus-or-minus\pm±3.59 5.4±plus-or-minus\pm±0.13 3.3±plus-or-minus\pm±0.04
mHuBERT-147 11.1±plus-or-minus\pm±0.1 58.5±plus-or-minus\pm±41.52 30.7±plus-or-minus\pm±1.15 99.3±plus-or-minus\pm±0.0 17.6±plus-or-minus\pm±0.09 9.6±plus-or-minus\pm±0.14 13.5±plus-or-minus\pm±0.04 11.0±plus-or-minus\pm±0.14 6.4±plus-or-minus\pm±0.04 100.0±plus-or-minus\pm±0.0 7.2±plus-or-minus\pm±0.2 5.4±plus-or-minus\pm±0.01
fra ful gle glg guj hau heb hin hrv hun hye ibo
MMS-1B 7.0±plus-or-minus\pm±0.1 98.4±plus-or-minus\pm±0.86 24.3±plus-or-minus\pm±0.1 3.8±plus-or-minus\pm±0.09 6.7±plus-or-minus\pm±0.24 8.2±plus-or-minus\pm±0.06 99.5±plus-or-minus\pm±0.12 7.2±plus-or-minus\pm±0.08 99.4±plus-or-minus\pm±0.0 5.8±plus-or-minus\pm±0.02 99.3±plus-or-minus\pm±0.0 13.1±plus-or-minus\pm±0.02
MMS-300M 10.5±plus-or-minus\pm±0.16 103.8±plus-or-minus\pm±4.62 29.2±plus-or-minus\pm±0.08 5.1±plus-or-minus\pm±0.0 8.0±plus-or-minus\pm±0.02 9.2±plus-or-minus\pm±0.01 109.4±plus-or-minus\pm±10.43 9.3±plus-or-minus\pm±0.06 99.4±plus-or-minus\pm±0.0 9.2±plus-or-minus\pm±0.09 98.6±plus-or-minus\pm±0.67 14.7±plus-or-minus\pm±0.11
XLS-R-300M 9.0±plus-or-minus\pm±0.21 94.8±plus-or-minus\pm±3.23 29.1±plus-or-minus\pm±0.34 4.6±plus-or-minus\pm±0.03 8.5±plus-or-minus\pm±0.12 10.1±plus-or-minus\pm±0.11 107.5±plus-or-minus\pm±0.6 10.3±plus-or-minus\pm±0.15 95.8±plus-or-minus\pm±3.62 8.7±plus-or-minus\pm±0.22 96.2±plus-or-minus\pm±3.13 15.0±plus-or-minus\pm±0.2
mHuBERT-147 16.1±plus-or-minus\pm±0.08 19.1±plus-or-minus\pm±0.03 32.7±plus-or-minus\pm±0.08 8.3±plus-or-minus\pm±0.05 10.4±plus-or-minus\pm±0.16 12.0±plus-or-minus\pm±0.19 22.5±plus-or-minus\pm±0.02 12.5±plus-or-minus\pm±0.25 8.5±plus-or-minus\pm±0.05 14.1±plus-or-minus\pm±0.23 100.0±plus-or-minus\pm±0.0 17.1±plus-or-minus\pm±0.03
ind isl ita jav jpn kam kan kat kaz kea khm kir
MMS-1B 3.9±plus-or-minus\pm±0.09 9.4±plus-or-minus\pm±0.01 2.1±plus-or-minus\pm±0.01 5.4±plus-or-minus\pm±0.23 22.2±plus-or-minus\pm±0.26 12.5±plus-or-minus\pm±0.06 5.8±plus-or-minus\pm±0.01 5.6±plus-or-minus\pm±0.06 3.3±plus-or-minus\pm±0.08 4.9±plus-or-minus\pm±0.02 16.4±plus-or-minus\pm±0.07 4.6±plus-or-minus\pm±0.14
MMS-300M 4.8±plus-or-minus\pm±0.1 15.5±plus-or-minus\pm±0.07 2.8±plus-or-minus\pm±0.04 6.8±plus-or-minus\pm±0.15 33.3±plus-or-minus\pm±0.29 13.8±plus-or-minus\pm±0.39 7.3±plus-or-minus\pm±0.12 8.0±plus-or-minus\pm±0.08 4.1±plus-or-minus\pm±0.01 5.8±plus-or-minus\pm±0.03 22.0±plus-or-minus\pm±0.14 5.8±plus-or-minus\pm±0.15
XLS-R-300M 4.8±plus-or-minus\pm±0.21 16.0±plus-or-minus\pm±0.25 2.4±plus-or-minus\pm±0.01 7.1±plus-or-minus\pm±0.19 34.9±plus-or-minus\pm±0.17 15.0±plus-or-minus\pm±0.15 7.8±plus-or-minus\pm±0.04 7.9±plus-or-minus\pm±0.03 4.0±plus-or-minus\pm±0.16 5.3±plus-or-minus\pm±0.03 22.5±plus-or-minus\pm±0.85 6.1±plus-or-minus\pm±0.05
mHuBERT-147 6.8±plus-or-minus\pm±0.15 14.9±plus-or-minus\pm±0.27 4.8±plus-or-minus\pm±0.06 9.6±plus-or-minus\pm±0.09 38.4±plus-or-minus\pm±0.73 16.3±plus-or-minus\pm±0.08 8.5±plus-or-minus\pm±0.12 10.3±plus-or-minus\pm±0.19 5.6±plus-or-minus\pm±0.17 7.8±plus-or-minus\pm±0.01 24.5±plus-or-minus\pm±0.11 7.5±plus-or-minus\pm±0.08
kor lao lav lin lit ltz lug luo mal mar mkd mlt
MMS-1B 15.0±plus-or-minus\pm±0.14 24.9±plus-or-minus\pm±0.67 3.4±plus-or-minus\pm±0.13 5.1±plus-or-minus\pm±0.09 5.0±plus-or-minus\pm±0.01 8.9±plus-or-minus\pm±0.22 9.5±plus-or-minus\pm±0.14 6.3±plus-or-minus\pm±0.05 5.3±plus-or-minus\pm±0.19 8.7±plus-or-minus\pm±0.14 2.6±plus-or-minus\pm±0.01 4.9±plus-or-minus\pm±0.06
MMS-300M 21.7±plus-or-minus\pm±0.5 29.5±plus-or-minus\pm±0.22 4.7±plus-or-minus\pm±0.09 5.9±plus-or-minus\pm±0.12 7.1±plus-or-minus\pm±0.03 11.1±plus-or-minus\pm±0.11 10.2±plus-or-minus\pm±0.01 7.3±plus-or-minus\pm±0.07 6.3±plus-or-minus\pm±0.05 11.2±plus-or-minus\pm±0.07 3.4±plus-or-minus\pm±0.08 6.2±plus-or-minus\pm±0.04
XLS-R-300M 24.6±plus-or-minus\pm±1.37 30.4±plus-or-minus\pm±0.15 4.1±plus-or-minus\pm±0.08 6.0±plus-or-minus\pm±0.13 6.1±plus-or-minus\pm±0.01 10.7±plus-or-minus\pm±0.01 10.6±plus-or-minus\pm±0.02 8.1±plus-or-minus\pm±0.12 6.6±plus-or-minus\pm±0.14 12.2±plus-or-minus\pm±0.02 3.2±plus-or-minus\pm±0.02 5.4±plus-or-minus\pm±0.01
mHuBERT-147 20.7±plus-or-minus\pm±0.22 33.5±plus-or-minus\pm±0.17 7.6±plus-or-minus\pm±0.1 7.6±plus-or-minus\pm±0.06 10.2±plus-or-minus\pm±0.07 14.9±plus-or-minus\pm±0.03 11.8±plus-or-minus\pm±0.15 9.3±plus-or-minus\pm±0.22 7.6±plus-or-minus\pm±0.05 14.1±plus-or-minus\pm±0.15 5.1±plus-or-minus\pm±0.18 8.7±plus-or-minus\pm±0.13
mon mri msa mya nep nld nob nso nya oci ori orm
MMS-1B 8.6±plus-or-minus\pm±0.23 99.3±plus-or-minus\pm±0.0 4.6±plus-or-minus\pm±0.02 15.7±plus-or-minus\pm±0.48 10.0±plus-or-minus\pm±0.25 5.3±plus-or-minus\pm±0.12 5.2±plus-or-minus\pm±0.04 8.0±plus-or-minus\pm±0.13 99.3±plus-or-minus\pm±0.0 96.3±plus-or-minus\pm±2.79 11.3±plus-or-minus\pm±0.12 16.8±plus-or-minus\pm±0.37
MMS-300M 11.2±plus-or-minus\pm±0.04 99.7±plus-or-minus\pm±0.35 6.7±plus-or-minus\pm±0.05 20.6±plus-or-minus\pm±0.2 12.4±plus-or-minus\pm±0.34 8.0±plus-or-minus\pm±0.16 6.5±plus-or-minus\pm±0.05 9.7±plus-or-minus\pm±0.11 99.3±plus-or-minus\pm±0.0 99.4±plus-or-minus\pm±0.0 16.2±plus-or-minus\pm±0.09 19.6±plus-or-minus\pm±0.17
XLS-R-300M 11.0±plus-or-minus\pm±0.32 96.7±plus-or-minus\pm±0.24 7.0±plus-or-minus\pm±0.09 20.2±plus-or-minus\pm±0.11 13.2±plus-or-minus\pm±0.47 6.6±plus-or-minus\pm±0.14 6.0±plus-or-minus\pm±0.13 10.3±plus-or-minus\pm±0.21 99.3±plus-or-minus\pm±0.0 99.4±plus-or-minus\pm±0.0 18.4±plus-or-minus\pm±0.05 20.0±plus-or-minus\pm±0.47
mHuBERT-147 14.0±plus-or-minus\pm±0.43 11.5±plus-or-minus\pm±0.01 9.9±plus-or-minus\pm±0.16 22.3±plus-or-minus\pm±0.12 15.4±plus-or-minus\pm±0.19 12.2±plus-or-minus\pm±0.26 9.5±plus-or-minus\pm±0.16 12.5±plus-or-minus\pm±0.07 99.3±plus-or-minus\pm±0.0 99.4±plus-or-minus\pm±0.0 19.6±plus-or-minus\pm±0.1 21.1±plus-or-minus\pm±0.3
pan pol por pus ron rus slk slv sna snd som spa
MMS-1B 9.1±plus-or-minus\pm±0.03 4.2±plus-or-minus\pm±0.01 4.4±plus-or-minus\pm±0.08 18.3±plus-or-minus\pm±0.12 5.0±plus-or-minus\pm±0.03 4.8±plus-or-minus\pm±0.01 3.5±plus-or-minus\pm±0.07 6.2±plus-or-minus\pm±0.0 5.0±plus-or-minus\pm±0.08 99.1±plus-or-minus\pm±0.0 99.3±plus-or-minus\pm±0.0 2.9±plus-or-minus\pm±0.08
MMS-300M 11.4±plus-or-minus\pm±0.14 6.0±plus-or-minus\pm±0.01 5.8±plus-or-minus\pm±0.02 21.0±plus-or-minus\pm±0.23 7.2±plus-or-minus\pm±0.02 6.5±plus-or-minus\pm±0.07 4.7±plus-or-minus\pm±0.1 8.2±plus-or-minus\pm±0.19 5.9±plus-or-minus\pm±0.02 96.1±plus-or-minus\pm±3.05 97.7±plus-or-minus\pm±1.69 3.8±plus-or-minus\pm±0.03
XLS-R-300M 12.4±plus-or-minus\pm±0.15 5.0±plus-or-minus\pm±0.05 4.9±plus-or-minus\pm±0.04 21.5±plus-or-minus\pm±0.51 6.3±plus-or-minus\pm±0.05 6.0±plus-or-minus\pm±0.15 4.0±plus-or-minus\pm±0.07 7.2±plus-or-minus\pm±0.11 6.5±plus-or-minus\pm±0.03 99.1±plus-or-minus\pm±0.0 96.7±plus-or-minus\pm±2.66 3.3±plus-or-minus\pm±0.05
mHuBERT-147 14.5±plus-or-minus\pm±0.07 10.2±plus-or-minus\pm±0.18 9.0±plus-or-minus\pm±0.26 24.0±plus-or-minus\pm±0.29 10.9±plus-or-minus\pm±0.0 9.2±plus-or-minus\pm±0.08 7.6±plus-or-minus\pm±0.07 11.1±plus-or-minus\pm±0.02 8.5±plus-or-minus\pm±0.19 99.1±plus-or-minus\pm±0.0 100.0±plus-or-minus\pm±0.0 6.1±plus-or-minus\pm±0.08
srp swa swe tam tel tgk tha tur ukr umb urd uzb
MMS-1B 14.2±plus-or-minus\pm±0.11 4.2±plus-or-minus\pm±0.05 7.0±plus-or-minus\pm±0.23 99.3±plus-or-minus\pm±0.0 99.3±plus-or-minus\pm±0.0 5.1±plus-or-minus\pm±0.04 12.6±plus-or-minus\pm±0.11 4.3±plus-or-minus\pm±0.16 4.9±plus-or-minus\pm±0.01 16.6±plus-or-minus\pm±0.27 8.8±plus-or-minus\pm±0.11 7.7±plus-or-minus\pm±0.04
MMS-300M 16.9±plus-or-minus\pm±0.17 5.5±plus-or-minus\pm±0.06 10.0±plus-or-minus\pm±0.06 98.0±plus-or-minus\pm±1.32 99.3±plus-or-minus\pm±0.0 5.9±plus-or-minus\pm±0.05 16.1±plus-or-minus\pm±0.17 5.8±plus-or-minus\pm±0.09 6.8±plus-or-minus\pm±0.0 18.9±plus-or-minus\pm±0.41 10.9±plus-or-minus\pm±0.05 10.3±plus-or-minus\pm±0.0
XLS-R-300M 16.5±plus-or-minus\pm±0.66 5.6±plus-or-minus\pm±0.1 9.1±plus-or-minus\pm±0.01 92.9±plus-or-minus\pm±6.42 99.3±plus-or-minus\pm±0.0 6.2±plus-or-minus\pm±0.18 16.0±plus-or-minus\pm±0.19 5.8±plus-or-minus\pm±0.09 6.1±plus-or-minus\pm±0.08 20.0±plus-or-minus\pm±0.64 13.6±plus-or-minus\pm±0.15 10.7±plus-or-minus\pm±0.12
mHuBERT-147 18.8±plus-or-minus\pm±0.15 6.9±plus-or-minus\pm±0.04 15.2±plus-or-minus\pm±0.11 16.3±plus-or-minus\pm±0.22 99.3±plus-or-minus\pm±0.0 7.0±plus-or-minus\pm±0.08 17.7±plus-or-minus\pm±0.12 7.4±plus-or-minus\pm±0.07 9.6±plus-or-minus\pm±0.0 23.4±plus-or-minus\pm±0.12 15.1±plus-or-minus\pm±0.36 13.8±plus-or-minus\pm±0.27
vie wol xho yor yue zul
MMS-1B 10.9±plus-or-minus\pm±0.11 14.7±plus-or-minus\pm±0.1 6.7±plus-or-minus\pm±0.09 18.2±plus-or-minus\pm±0.22 16.1±plus-or-minus\pm±0.01 6.6±plus-or-minus\pm±0.03
MMS-300M 14.0±plus-or-minus\pm±0.36 16.1±plus-or-minus\pm±0.29 7.6±plus-or-minus\pm±0.02 21.1±plus-or-minus\pm±0.17 58.1±plus-or-minus\pm±19.39 8.3±plus-or-minus\pm±0.15
XLS-R-300M 15.6±plus-or-minus\pm±0.31 16.2±plus-or-minus\pm±0.4 8.0±plus-or-minus\pm±0.07 22.1±plus-or-minus\pm±0.5 46.3±plus-or-minus\pm±5.03 8.5±plus-or-minus\pm±0.0
mHuBERT-147 17.3±plus-or-minus\pm±0.14 17.8±plus-or-minus\pm±0.01 9.8±plus-or-minus\pm±0.05 23.9±plus-or-minus\pm±0.08 36.1±plus-or-minus\pm±1.55 11.1±plus-or-minus\pm±0.01
Table 11: FLEURS-102 CER (\downarrow) average results per language with standard deviations. Best scores in bold.
General
Avg (102)
Seen
(87)
Unseen
(15)
XLS-R-300M 24.5 23.8 28.3
mHuBERT-147 21.1 20.8 22.6
Table 12: FLEURS-102 CER (\downarrow) average scores for seen and unseen languages.

B.3 Monolingual versus multilingual HuBERT

In this section we compare the English speech representation from our best mHuBERT-147 model against a model of same capacity, but trained on English only data [23].222222Available at: https://huggingface.co/facebook/hubert-base-ls960 With these experiments, we want to illustrate that mHuBERT-147 is not only competitive against larger multilingual speech representation models (See Sections 4 and 5), but it is also competitive to monolingual solutions that do not share their parameters across languages.

Setup.

We train ASR models for open datasets from the ESB benchmark [20]. We use the data selection method from Lagos and Calapodescu [27] to downsample training data to 50 h for faster training with marginal performance impact. We also adopt their experimental setup, which we now detail.
We train English ASR models using either mHuBERT-147 or the original HuBERT-base (hubert-base-ls960) as speech encoder, and follow the SpeechBrain library [42]’s recipe for ASR – detailed in Appendix Section A.4 (proxy evaluation). In these experiments too, we train ASR models using two distinct optimizers, one for the upstream model, and another for the three subsequent neural networks that precede the CTC output projection layer (lr=0.9𝑙𝑟0.9lr=0.9italic_l italic_r = 0.9). Table 13 presents hyper-parameters we use per dataset.

Results.

Table 14 presents WER results for both models across all datasets. The average WER for the ASR models using HuBERT-base and mHuBERT-147 is respectively 23.1 and 23.6. Unsurprisingly, we find that fine-tuning a pre-trained model trained solely for the target language (HuBERT-base) yields better results for almost all datasets. Nonetheless, we observe that mHuBERT-147, despite being trained on multilingual settings and having to share its parameters across many languages, still performs competitively with the English model. This model is on average only 0.5 WER points worse than HuBERT-base. We believe that the competitiveness of mHuBERT-147 compared to this monolingual model highlights its potential as a compact yet powerful backbone for multilingual speech applications.

warm-up learning rate dropout epochs
steps upstream
AMI [5, 43] 200 5e-5 0.3 30
CV [1] 100 5e-5 0 30
Earnings-22 [15] 200 5e-5 0 30
GigaSpeech [7] 200 1e-5 0 40
LibriSpeech [36] 100 1e-5 0.15 40
TED-LIUM [22] 200 5e-5 0 40
VP [55] 100 5e-5 0 35
Table 13: Hyper-parameters for the English ASR models following the setup from Lagos and Calapodescu [27].
HuBERT-base mHuBERT-147
AMI 33.1 33.6 (+0.5)
CV 37.2 35.4 (-1.8)
Earnings-22 35.3 34.7 (-0.6)
GigaSpeech 25.9 26.3 (+0.4)
LibriSpeech-clean 7.7 9.7 (+2.0)
LibriSpeech-other 13.6 17.3 (+3.7)
TED-LIUM 11.7 13.1 (+1.4)
VP 18.7 19.0 (+0.3)
Average WER (\downarrow) 23.1 23.6 (+0.5)
Table 14: WER (\downarrow) scores for English ASR systems trained on the different datasets from the ESB Benchmark. Following Lagos and Calapodescu [27], only 50 h of training data are used. The score difference between mHuBERT-147 and HuBERT-base scores is presented between parentheses.

Appendix C Data

C.1 Validation Set

The validation set is the same for all our experiments. It was created by sampling 5 utterances per (language, source) pair from the training set, resulting in 1,275 examples. These examples are not removed from the training set.

C.2 VoxLingua107 Dataset Filtering

Manual inspection revealed that some language splits of this dataset contained a considerable amount of music-, noise- and silence-only files. These occurrences were more frequent in less-resourced languages. In order to increase the quality of the data we feed into our self-supervised models, and to more accurately estimate the amount of speech data per language they learn from, we filtered this dataset by performing music, noise and silence detection using the inaSpeechSegmenter tool [16].

Using this tool’s default settings, and knowing that the average utterance length for this dataset is 9s9𝑠9s9 italic_s, we classify a file as music if a music event is detected for longer than 2s2𝑠2s2 italic_s. Similarly, a file is classified as noise if there is a noise event for longer than 2s2𝑠2s2 italic_s, or if a non-energy event (i.e. silence) is detected for longer than 5s5𝑠5s5 italic_s. These settings were optimized using Cebuano, a language for which we observed a significant amount of noisy utterances. In total, we removed over 332K potentially noisy utterances (249K music, 83K noise/silence). We make both our filtering script232323Scripts available at: https://github.com/utter-project/mHuBERT-147-scripts/tree/main/01_dataset_preproc/dataset_specific_preproc/voxlingua107 and list of retained VL files available.242424Manifest files available at: https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest

Table 15 presents the 107 languages of this dataset sorted into buckets corresponding to the percentage of noisy input detected by the automatic tool. We notice that for most languages, this percentage is limited: 77 languages have noise below 15%. However, for some low-resource languages, the amount of identified noisy utterances can go up to 80%. We take these automatic filtering results cautiously – as they might not be truly representative of the amount of noise present in all of these languages. However, we do believe they illustrate an existing quality issue of this popular corpus for some languages.

% of Noise Count Languages
[0,5]05[0,5][ 0 , 5 ] 11 abk, hye, kat, lit, ltz, sqi, tuk
(5,10] 44
afr, amh, ara, asm, aze, bak, bel,
bod, bos, bul, cat, ces, deu, ell,
epo, est, fas, fra, glg, heb, hrv,
hun, ita, kaz, kor, mkd, mlt, mon,
nep, nld, nor, pol, por, pus, ron,
rus, sin, slk, slv, som, srp, swe,
tat, tgk, tur, ukr, uzb, zh-CN
(10,15] 22
dan, eng, fil, fin, hat, hau, hin,
ind, isl, jpn, kan, lav, lin, mal,
mar, mlg, msa, oci, spa, swa,
tam, vie
(15,20] 10
ben, cym, eus, fao, glv,
guj, lao, tel, tha, urd
(20,30] 6 bre, ina, mya, nno, pan, yid
(30,40] 3 sna, sun, yor
(40,60] 6 ceb, jav, khm, mri, sco, snd
(50,80] 5 grn, haw, lat, san, war
Table 15: The overall percentage of noise (music, silence, noise) detected in the different language splits of VL using an automatic tool.

Lastly, we note that another key issue in this corpus is the miscategorization of language in many speech utterances. In particular, we observed many English utterances that were labelled as belonging to other languages, with this issue being far more prevalent for low-resource languages. Since resolving this would require a powerful language identification tool to work well in all of the 107 languages present in this corpus, we decided to not filter or remove these utterances – but we do highlight that the number of utterances in low-resourced languages for this corpus is probably an over-estimation.

Dataset Full Name Download URL
Aishell Aishell [4] https://www.openslr.org/33/
Aishell-3 AISHELL-3 [49] https://www.openslr.org/93/
B-TTS BibleTTS [25] http://www.openslr.org/129/
Clovacall ClovaCall [21] https://github.com/clovaai/ClovaCall
CV Common Voice version 11.0 [1] https://commonvoice.mozilla.org/en/datasets
G-TTS High quality TTS data for Javanese [50] http://www.openslr.org/41/
High quality TTS data for Khmer [50] http://www.openslr.org/42/
High quality TTS data for Nepali [50] http://www.openslr.org/43/
High quality TTS data for Sundanese [50] http://www.openslr.org/44/
High quality TTS data for four South African languages [54] http://www.openslr.org/32/
High quality TTS data for Bengali languages [50] https://www.openslr.org/37/
IISc-MILE IISc-MILE Tamil ASR Corpus [32, 33] https://www.openslr.org/127/
IISc-MILE Kannada ASR Corpus [32, 33] http://www.openslr.org/126/
JVS Japanese versatile speech [52] https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus
Kokoro Kokoro https://github.com/kaiidams/Kokoro-Speech-Dataset
kosp2e Korean Speech to English Translation Corpus [11] https://github.com/warnikchow/kosp2e
MLS Multilingual LibriSpeech [39] http://www.openslr.org/94/
MS MediaSpeech [26] https://www.openslr.org/108/
Samrómur Samrómur Unverified 22.07 [51] https://www.openslr.org/128/
THCHS-30 THCHS-30 [56] https://www.openslr.org/18/
THUYG-20 THUYG-20 [45] https://www.openslr.org/22/
VL VoxLingua107 [53] https://bark.phon.ioc.ee/voxlingua107/
VP VoxPopuli [55] https://github.com/facebookresearch/voxpopuli/
Table 16: Downloading URLs for mHuBERT-147 data. All data was downloaded between November 2022 and February 2023.
IETF code Language Dataset # Hours
abk Abkhazian CV 34.0
VL 10.0
afr Afrikaans G-TTS 3.3
VL 100.0
amh Amharic VL 74.0
ara Arabic CV 61.0
VL 53.6
asm Assamese CV 1.0
VL 143.0
ast Asturian CV 0.01
aze Azerbaijani CV 0.05
VL 55.6
bak Bashkir CV 213.4
VL 54.7
bas Basaa CV 0.9
bel Belarusian CV 1,101.6
VL 126.4
ben Bengali CV 33.2
G-TTS 2.0
VL 45.9
bod Tibetan VL 92.4
bos Bosnian VL 97.7
bre Breton CV 4.4
VL 33.1
bul Bulgarian CV 4.6
VL 47.7
VP 773.3
cat Catalan CV 1,638.7
VL 80.3
ceb Cebuano VL 3.9
ces Czech CV 38.2
VL 62.4
VP 866.1
chv Chuvash CV 17.3
ckb Central Kurdish CV 90.5
cnh Hakha Chin CV 0.7
cym Welsh CV 100.6
VL 65.6
dan Danish CV 3.5
VL 25.0
VP 728.4
deu German CV 1,091.0
MLS 1,966.5
VL 36.3
VP 274.8
div Maldivian CV 32.4
ell Greek CV 13.0
VL 60.5
VP 573.2
eng English CV 2,213.9
MLS 44,659.7
VL 43.5
VP 293.6
epo Esperanto CV 1,355.6
VL 8.9
est Estonian CV 29.9
VL 34.7
VP 682.7
Table 17: (1/5) List of included languages, with corresponding amount of speech data per dataset.
IETF code Language Dataset # Hours
eus Basque CV 79.0
VL 25.5
ewe Ewe B-TTS 76.6
fao Faroese VL 56.4
fas Persian CV 312.4
VL 52.0
fil Tagalog VL 82.7
fin Finnish CV 5.0
VL 29.5
VP 816.1
fra French CV 836.0
MLS 1,076.6
VL 63.2
VP 290.8
fry Frisian CV 41.2
gle Irish CV 3.2
glg Galician CV 4.4
VL 66.0
glv Manx Gaelic VL 3.5
grn Guarani CV 0.4
VL 1.0
guj Gujarati VL 39.5
hat Haitian Creole VL 86.3
hau Hausa B-TTS 85.6
CV 2.3
VL 81.1
haw Hawaiian VL 5.4
heb Hebrew VL 89.7
hin Hindi CV 5.0
VL 73.2
hrv Croatian VL 109.4
VP 836.6
hsb Upper Sorbian CV 1.5
hun Hungarian CV 9.5
VL 68.8
VP 851.0
hye Armenian CV 1.0
VL 66.5
ibo Igbo CV 0.01
ina Interlingua CV 8.3
VL 2.1
ind Indonesian CV 20.6
VL 34.4
isl Icelandic samromur 2,088.3
VL 81.2
ita Italian CV 271.9
MLS 247.4
VL 46.6
VP 613.8
jav Javanese G-TTS 3.5
VL 24.4
jpn Japanese CV 37.0
JVS 26.4
kokoro 60.0
VL 50.1
kab Kabyle CV 516.6
Table 18: (2/5) List of included languages, with corresponding amount of speech data per dataset.
IETF code Language Dataset # Hours
kan Kannada IISc-MILE 273.8
VL 39.9
kat Georgian CV 6.6
VL 93.4
kaz Kazakh CV 0.6
VL 72.6
khm Khmer G-TTS 4.0
VL 27.0
kin Kinyarwanda CV 1,982.7
kir Kyrgyz CV 32.6
kmr Northern Kurdish CV 45.8
kor Korean clovacall 38.1
kosp2e 190.9
VL 71.5
lao Lao VL 36.1
lat Latin VL 30.9
lav Latvian CV 3.0
VL 37.0
VP 868.4
lin Lingala B-TTS 54.0
VL 77.8
lit Lithuanian CV 7.1
VL 78.4
VP 796.2
ltz Luxembourgish VL 71.8
lug Ganda CV 363.0
mal Malayalam CV 0.5
VL 42.5
mar Marathi CV 11.9
VL 74.7
mdf Moksha CV 0.2
mhr Meadow Mari CV 97.5
mkd Macedonian CV 0.2
VL 103.9
mlg Malagasy VL 94.5
mlt Maltese CV 3.9
VL 62.9
VP 818.1
mon Mongolian CV 6.6
VL 65.9
mri Māori VL 20.5
mrj Western Mari CV 6.5
msa Malay VL 71.8
mya Burmese VL 31.5
myv Erzya CV 1.1
nan-tw Taiwanese Hokkien CV 0.7
nep Nepali CV 0.01
G-TTS 2.8
VL 65.6
nld Dutch CV 72.1
MLS 1,554.2
VL 37.2
VP 277.0
nno Norwegian Nynorsk CV 0.4
VL 43.3
nor Norwegian VL 98.2
oci Occitan VL 13.7
Table 19: (3/5) List of included languages, with corresponding amount of speech data per dataset.
IETF code Language Dataset # Hours
ori Odia CV 0.8
pan Punjabi CV 1.0
VL 42.1
pol Polish CV 129.4
MLS 103.6
VL 76.2
VP 841.9
por Portuguese CV 102.3
MLS 161.0
VL 58.6
VP 851.6
pus Pashto VL 44.7
rm-sursilv Sursilvan CV 2.4
rm-vallader Vallader CV 1.2
ron Romanian CV 8.5
VL 59.0
VP 834.8
rus Russian CV 149.1
VL 67.5
sah Sakha CV 2.7
san Sanskrit VL 4.6
sat Santali CV 0.4
sco Scots VL 1.5
sin Sinhala VL 60.6
skr Saraiki CV 1.3
slk Slovak CV 12.9
VL 36.6
VP 644.5
slv Slovenian CV 7.6
VL 112.3
VP 832.1
sna Shona VL 21.5
snd Sindhi VL 48.2
som Somali VL 94.9
sot Southern Sotho G-TTS 3.2
spa Spanish CV 380.7
MLS 917.7
VL 33.9
VP 301.5
sqi Albanian VL 67.2
srd Sardinian CV 0.7
srp Serbian CV 0.8
VL 47.5
sun Sundanese G-TTS 2.1
VL 43.6
swa Swahili CV 304.2
VL 57.6
swe Swedish CV 29.5
VL 31.3
VP 827.4
tam Tamil CV 186.1
IISc-MILE 132.2
VL 44.1
tat Tatar CV 20.3
VL 93.4
tel Telugu VL 64.3
tgk Tajik VL 60.5
Table 20: (4/5) List of included languages, with corresponding amount of speech data per dataset.
IETF code Language Dataset # Hours
tha Thai CV 134.7
VL 50.8
tig Tigre CV 0.01
tpi Tok Pisin CV 3.3
tsn Tswana G-TTS 3.5
tuk Turkmen VL 81.8
tur Turkish CV 59.3
MS 10.0
VL 54.5
tw-akuapem Akwapem Twi B-TTS 59.8
tw-asante Asante Twi B-TTS 56.8
uig Uyghur CV 94.9
uig Uyghur THUYG-20 20.7
ukr Ukrainian CV 52.5
VL 49.7
urd Urdu CV 38.8
VL 35.2
uzb Uzbek CV 69.7
VL 42.2
vie Vietnamese CV 3.8
VL 55.5
vot Votic CV 0.1
war Waray VL 3.9
xho Xhosa G-TTS 3.1
yid Yiddish VL 35.9
yor Yoruba B-TTS 24.8
VL 66.0
yue Cantonese CV 15.6
zh-CN Chinese (PRC) Aishell 151.1
Aishell-3 60.6
CV 104.6
THCHS-30 25.5
VL 40.9
zh-HK Chinese (Hong Kong) CV 89.5
zh-TW Chinese (Taiwan) CV 56.3
Total 90,429.5
Table 21: (5/5) List of included languages, with corresponding amount of speech data per dataset.