\definechangesauthor

[name=Nikos, color=orange]nik \correspondingauthor[email protected] \affiliations¹NAVER LABS Europe - France
²University of Edinburgh - UK \contributionsTraining code: https://github.com/utter-project/fairseq
Pre-processing code: https://github.com/utter-project/mHuBERT-147-scripts
Model Weights: https://huggingface.co/utter-project/mHuBERT-147 \teaserfig Monolingual ASR Multilingual ASR LID Multilingual ASR + LID normal normal few-shot normal normal normal few-shot SSL # Params CER ( $\downarrow$ ) CER ( $\downarrow$ ) CER ( $\downarrow$ ) ACC ( $\uparrow$ ) ACC ( $\uparrow$ ) CER ( $\downarrow$ ) CER ( $\downarrow$ ) SUPERB_s ( $\uparrow$ ) MMS-1B 965M 33.3 / 25.7 21.3 / 18.1 30.2 / 30.8 84.8 / 86.1 73.3 / 74.8 26.0 / 25.5 25.4 / 24.8 983.5 \scalerel* [Uncaptioned image] X / 948.1 \scalerel* X mHuBERT-147 95M 34.2 / 26.3 23.6 / 22.0 33.2 / 32.9 85.3 / 91.0 81.4 / 90.0 26.2 / 22.1 34.9 / 33.5 949.8 \scalerel* X / 950.2 \scalerel* X MMS-300M 317M 33.8 / 30.5 28.7 / 24.0 36.5 / 36.5 62.3 / 84.3 71.9 / 74.3 31.5 / 30.0 30.9 / 29.2 824.9 \scalerel* [Uncaptioned image] X / 844.3 NWHC1 317M 39.5 / 30.5 28.9 / 21.5 41.4 / 38.6 67.1 / 87.4 77.1 / 90.6 28.8 / 21.5 40.3 / 38.2 774.4 / 876.9 \scalerel* X NWHC2 317M 39.5 / 30.5 29.3 / 21.6 42.0 / 39.3 64.4 / 88.1 77.4 / 90.6 28.4 / 21.8 41.5 / 38.8 759.9 / 873.3 XLS-R-300M 317M 39.7 / 30.6 29.2 / 22.0 40.9 / 39.3 66.9 / 87.9 55.6 / 85.6 28.4 / 22.9 42.1 / 42.4 730.8 / 850.5 WavLabLM-large-MS 317M 40.5 / 32.8 37.8 / 31.9 43.8 / 42.8 71.7 / 81.1 70.8 / 80.0 37.0 / 32.2 43.4 / 41.2 707.5 / 740.9 \teasercaptionML-SUPERB 10min/1h leaderboard. Updated SOTA scores for each metric are presented in bold. The first ( \scalerel* [Uncaptioned image] X ), second ( \scalerel* X ) and third ( \scalerel* X ) best SUPERB scores are highlighted.

mHuBERT-147: A Compact Multilingual HuBERT Model

Abstract

We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.

1 Introduction

Self-supervised Learning (SSL) approaches for speech representation learning are the core foundation blocks of speech processing systems today. These models leverage large amounts of unlabeled speech data during training, and learn from different pretext tasks in order to build contextualized speech representations that can be leveraged for various downstream tasks. For English, several models have been proposed [38, 41, 3, 23, 8, 31], but most multilingual models available to the community are based on wav2vec 2.0 [13, 2, 40], with the only current exception being WavLabLM [10]. Meanwhile, the Hidden units BERT model (HuBERT, [23]) – where pre-training is performed in 2 or 3 iterations, and target training labels are externally obtained via k-means clustering – presents superior performance on the English SSL benchmark SUPERB [59], outperforming wav2vec 2.0. The English version of this model also shows decent cross-lingual adaptation capabilities on the multilingual benchmark ML-SUPERB [47]. Recently, HuBERT has also emerged as a popular choice for producing discrete speech units for multimodal LLMs [57, 6, 12].

However, despite recent efforts to reduce hardware costs [29, 30, 9], HuBERT’s superior performance comes with higher training costs – a characteristic that is further accentuated in multilingual settings requiring larger amounts of speech data. These cost demands arise from HuBERT’s multi-iteration training process, which contrasts with wav2vec 2.0’s single iteration training on unlabeled speech data alone. HuBERT requires high-dimensional feature extraction across the entire training dataset to generate discrete labels, along with a minimum of two model training and clustering steps, resulting in increased disk and CPU/GPU resource demands. We are aware of only small-scale multilingual HuBERT models that train using a single dataset: there is an mHuBERT trained on 3 languages [28]; and a HuBERT collection covering between 5 and 12 languages [18].

In contrast, in this work, we tackle the challenge of training the first general-purpose massively multilingual HuBERT speech representation model. This model is trained using 90,430 hours of clean open-license data across 147 languages. For reducing pre-processing costs, we downsample large popularly used speech datasets, hypothesizing that source diversity is more important than quantity. We also propose to replace the original HuBERT clustering implementation with faiss-based clustering [17], increasing label assignment speed by 5.2 times. Finally, for training, we employ a two-level multilingual up-sampling approach factoring in both linguistic and dataset diversity for increased overall multilingual performance.

After three training iterations, and with only 95M parameters, our compact mHuBERT-147 model outperforms multilingual wav2vec 2.0 models that are not only three times larger, but also trained with almost five times more data. It reaches respectively second and first place on the 10 min and 1 h leaderboard of the popular multilingual benchmark ML-SUPERB, with state-of-the-art (SOTA) scores for three out of four language identification (LID) tasks. Across all ML-SUPERB tasks and settings, mHuBERT-147 consistently surpasses XLS-R (300 M parameters; 436 K hours) – its most comparable multilingual wav2vec 2.0 model. It also demonstrates strong competitiveness against the much larger MMS (1 B parameters; 491 K hours). Complementary few-shot ASR evaluation on FLEURS-102 shows that our model is competitive to larger models, presenting impressive robustness at a compact size (70% less parameters). Our findings suggest that mHuBERT-147 is a promising compact model for speech processing pipelines, offering an unprecedented balance between high performance and parameter efficiency. We open-source all our scripts, manifest files and model weights.¹¹1See Appendix Section A.5.

2 mHuBERT-147 data collection

We gathered 90,430 hours of speech from datasets with permissive licences in 147 languages. For this multilingual collection, our goal was to prioritize linguistic diversity over data quantity alone. Table 1 lists the datasets we use along with their licences. Figure 1 illustrates data distribution across languages. In total, our training set spans 19 language families (sorted in decreasing order of data quantity): Indo-European, Niger-Congo, Uralic, Afro-Asiatic, Constructed (Esperanto), Turkic, Dravidian, Sino-Tibetan, Austronesian, Koreanic, Kra-Dai, Japonic, Language isolate (Basque), Kartvelian, Austroasiatic, Mongolic, Northwest Caucasian, Creole and Tupian. Appendix Tables 17-21 list all included languages with amount of data per dataset.

Section 2.1 details data pre-processing and filtering process. Section 2.2 discuss the overlap between the data we use and popular multilingual SSL models for speech representation learning.

Dataset

Full Names and References

# Languages

# Hours (filtered)

License

Aishells

Aishell [4] and AISHELL-3 [49]

212

Apache License 2.0

B-TTS

BibleTTS [25]

358

CC BY-SA 4.0

Clovacall

ClovaCall [21]

MIT

Common Voice version 11.0 [1]

14,943

CC BY-SA 3.0

G-TTS

High quality TTS data for Javanese, Khmer,

Nepali, Sundanese, and Bengali Languages [50]

CC BY-SA 4.0

High quality TTS data for four South

African languages [54]

IISc-MILE

IISc-MILE Tamil and Kannada ASR Corpus [32, 33]

406

CC BY 2.0

JVS

Japanese versatile speech [52]

CC BY-SA 4.0

Kokoro

Kokoro Speech Dataset [24]

CC0

kosp2e

Korean Speech to English Translation Corpus [11]

191

CC0

MLS

Multilingual LibriSpeech [39]

50,687

CC BY 4.0

MediaSpeech [26]

CC BY 4.0

Samrómur

Samrómur Unverified 22.07 [51]

2,088

CC BY 4.0

TH-data

THCHS-30 [56] and THUYG-20 [45, 46]

Apache License 2.0

VoxLingua107 [53]

107

5,844

CC BY 4.0

VoxPopuli [55]

15,494

CC0

Table 1: Datasets used for training mHuBERT-147 with their corresponding abbreviation used throughout the paper (left), amount of languages and hours (after filtering) and licenses. Downloading URLs are listed at Appendix Table 16.

Refer to caption — Figure 1: Speech amount per language in a logarithmic scale, with different levels of speech resourcefulness: $\geq$ 800 h (blue), $\geq$ 100 h (green), $\geq$ 50 h (orange), $\geq$ 10 h (red), $\leq$ 10 h (purple). Some languages are excluded from the plot. Best seen in color.

2.1 Dataset pre-processing and filtering

We follow the default HuBERT pre-processing guidelines [23]. For all datasets, the speech data is converted to 16-bit 16kHz WAV files, with volume reduction of $5\%$ applied for datasets in which excessive clip** is observed during conversion. Data is filtered to the interval of $[2,30]$ s and any utterances outside this window are discarded. This is with the exception of the following, where file concatenation is performed: B-TTS (Akwapen Twi, Asante Twi, Hausa, Yoruba), IISc-MILE (Kannada), and JVS (Japanese). In these cases, we performed concatenation due to the already limited amount of speech samples we had available for training in those languages.

For TTS/ASR/ST datasets, only the training set is used, validation and test sets are always excluded to prevent data contamination. The complete manifest list can be found in our HuggingFace repository.²²2Manifest files available at: https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest We now detail dataset-specific pre-processing.³³3Pre-processing scripts available at: https://github.com/utter-project/mHuBERT-147-scripts/tree/main/01_dataset_preproc

•

VL: Manual inspection revealed that some language splits contained a significant amount of music, noise, and silence-only files. To improve the quality of the data used by our models, and to accurately estimate the amount of speech data per language, we filtered the dataset using the inaSpeechSegmenter tool [16]. In total, we removed over 332K potentially noisy utterances (249K music, 83K noise/silence). More details are given in Appendix Section C.2.
•

VP: For languages with ample representation in the multilingual collection (i.e., over 1,000h of speech - namely German, English, Spanish, French, and Dutch), we exclusively use the 10K subset of VP (2019 and 2020). For the other 18 languages, we utilize talks from 2017 to 2020 (100K VP split).

2.2 Data overlap with existing multilingual models

Table 2 shows the overlap in the training data of mHuBERT-147 and the multilingual wav2vec 2.0 [3] models: XLSR-53 [13], XLS-R [2] and MMS [40]. Looking at the table, we see that the model most similar to ours in terms of dataset overlap is XLS-R, with only BABEL not included in our training. This dataset is not freely available, and it comprises speech data in 17 low-resource African and Asian languages: Assamese, Bengali, Cantonese, Cebuano, Georgian, Haitian, Kazakh, Lao, Northern Kurdish, Pashto, Swahili, Tagalog, Tamil, Tok Pisin, Turkish, Vietnamese and Zulu. Of these, the only language we did not find an openly available dataset for was Zulu. Thus, mHuBERT-147 covers 127 of the 128 languages of XLS-R, while supporting 20 additional languages.

Dataset	# Langs	XLSR-53	XLS-R	MMS	mHuBERT-147
BABEL [19]	17	✓	✓	✓
CV v2	38	✓	✓	✓	✓
CV v6	60		✓	✓	✓
CV v9	89			✓	✓
CV v11	98				✓
MLS	8	✓	✓	✓	✓
VP	23		✓	✓	down-sampled
VL	107		✓	✓	down-sampled
MMS-lab-U [40]	1,362			✓
Aishells, B-TTS, Clovacall	23				✓
G-TTS, IIScMILE, JVS
Kokoro, kosp2e, MS
Samrómur, TH-data

Table 2: Datasets included in the training of the multilingual models most comparable to the mHuBERT-147 model. We highlight that although here we consider previous versions of CV as being entirely comprised on later versions, this is a simplification, and some files might have been removed due to speakers requests or down-votes.

3 mHuBERT-147 training

Our training follows the multi-iteration pre-training objective of HuBERT [23]. This approach leverages an external acoustic unit discovery model (k-means clustering) to provide frame-level targets for masked span prediction pretraining. For optimization, two weighted cross-entropy losses are computed over masked ( $L_{m}$ ) and unmasked ( $L_{u}$ ) spans, defined as $L=\psi L_{m}+(1-\psi)L_{u}$ . $L_{m}$ and $L_{u}$ can both be defined using the generic loss function in Equation 1. Here, $f$ refers to the masked prediction model. $X^{\prime}=[x_{1},..,x_{T}]$ is the masked speech utterance of frame length $T$ . $M\subset{\{1,..,T\}}$ are the masked indices in X’ for masked loss ( $L_{m}$ ), and the unmasked indices otherwise ( $L_{u}$ ). $Z=[z_{i},..,z_{T}]$ are the discrete labels from clustering. Finally, $P_{f}$ produces the distribution over the target indices at each time-step $t$ . The authors of HuBERT highlight that higher values of $\psi$ result in a model analogous to language modeling, where the prediction model $f$ is forced to learn both the unmasked segments and the long-range temporal structure of the speech data.

L(f;X^{\prime},M,Z)=\sum_{t\in M}\log P_{f}(z_{t}|\leavevmode\nobreak\ X^{% \prime},t)

(1)

Next, we dive into how mHuBERT-147 differs from HuBERT training, including a) a more complex data up-sampling strategy (3.1), and b) a replacement of the original k-means implementation with the efficient faiss [17] Inverted File Index (IVF) – drastically increasing labeling speed (3.2). Finally, we discuss our experimental setup (3.3).

3.1 Two-level language-source up-sampling

To optimize for exposure to different languages and data sources during training, we employ a two-level up-sampling strategy during multilingual batching. Let $N$ be the total number of examples in the training set, with $n_{l}$ corresponding to the count for language $l$ . The sampling probability for $l$ is $P_{l}\propto\left(\frac{n_{l}}{N}\right)^{\alpha}$ , where $\alpha$ is a hyper-parameter in $[0,1]$ ; with $\alpha=1$ resulting in no up-sampling, and lower values resulting in higher probabilities for under-represented languages.

For each epoch, we sample N times from the probability distributions $P_{l}$ . In this way we reach a quantity $B_{l}$ of examples selected per language $l$ . We sample $B_{l}$ utterances by considering varied data sources (datasets). The probability of sampling an utterance of $l$ from data source $x$ is given by $P_{x}\propto\left(\frac{n_{l}(x)}{n_{l}}\right)^{\beta}$ , where $n_{l}(x)$ corresponds to the number of examples of language $l$ from data source $x$ , and $\beta$ is a hyper-parameter in $[0,1]$ . We sort the $N$ selected utterances by length before batching, to minimize random crop**.

The most similar up-sampling strategy to ours is MMS – which considers (language, dataset) pairs as unique languages for up-sampling. However, their approach inflates the training time allocated to high-resource languages, as these tend to have more diverse data sources in the training set. Unlike them, we adopt a two-level technique to better suit our diverse dataset collection.

3.2 Efficient clustering with faiss indices

We use faiss-based [17] k-means clustering for faster label assignment. This library facilitates efficient similarity search and clustering of dense vectors. We cluster using Inverted File Index (IVF) with the following setting: OPQM_D,IVFK_HNSW32,PQMx4fsr, that we now detail.⁴⁴4More information can be found at: https://github.com/facebookresearch/faiss/wiki/The-index-factory

•

“OPQM_D,..,PQMx4fsr” is used for indexing RAM usage. This option optimizes vector representation by rotation and then performs input vector projection into dimension D. Product quantization (PQ) is then applied to hash the vectors into M 4-bit codes, resulting in the storage of M/2 bytes per vector. We use $M=16,D=64$ .
•

“..,IVFK_HNSW32,..” denotes the indexing itself. IVF performs coarse quantization via an efficient implementation of k-means. This option uses Hierarchical Navigable Small Worlds (HNSW) graphs for cluster assignment, with 32 being the number of links per vertex. In practice, this greatly speeds up clustering (5.2x faster than Hsu et al. [23]).

3.2.1 Faiss vs sklearn k-means

Our motivation to replace the sklearn k-means by faiss comes from our struggle to scale this approach to a multilingual setting with more data and a higher K. For instance, working with a 850 GB set of vectors of dimensionality 768, we were unable to finish the clustering step after four days, while faiss implementation was able to produce an index in 40 h. Moreover, label application is 5.2 times faster by using faiss indices compared to sklearn clustering. This speed up in label application is particularly relevant for our work, as we have a massive training set.

We verified that this new clustering setting does not impact downstream model quality by pretraining two baselines: a two-iteration English HuBERT following the original codebase, and a faiss-based equivalent model with the same number of centroids. Both pretrained models were then fine-tuned on the Librispeech [35] 100 h ASR, performing similarly on clean/other test sets. Appendix Section A.1 presents our detailed results and discussion.

3.3 Experimental setting

3.3.1 Faiss training

For clustering, we also sample data using the two-level up-sampling approach, to ensure alignment with the training data. We maximize the data quantity used for clustering by sampling with a RAM budget of 850 GB, which corresponds to over 6 M examples (20.8%) for iteration 1, and over 630 K examples (around 2%) for iteration 2 and 3. The reduced example count for later iterations is due to the features being high-dimensional ( $dim=768$ ) compared to iteration 1 MFCCs ( $dim=39$ ). We use $K=1000$ across all iterations.⁵⁵5Code available at: https://github.com/utter-project/mHuBERT-147-scripts/tree/main/03_faiss_indices/

3.3.2 Pre-training setting

We build our codebase on top of fairseq [34], introducing data loading optimizations and a new multilingual batching setting (Section 3.1).⁶⁶6Fork available at: https://github.com/utter-project/fairseq We train HuBERT base models (95 M parameters) following the original recipe’s hyperparameters but with a larger batch size (2.8 M instead of 1.4 M frames), input normalization, and scaling to more updates: 2 M steps (approximately 31 epochs), instead of the original 400 K. For our two-level up-sampling strategy, we select $\alpha=0.7$ (language up-sampling); and $\beta=0.9$ . (source up-sampling). Feature extraction for iteration 2 and 3 is performed using respectively the 6th and the 9th layer of the previous iteration last checkpoint. Training one iteration requires 32x A100-80GB GPUs running in 4 parallel nodes for approximately 20 days. More details regarding training are presented in Appendix Section A.4.

3.3.3 Hardware considerations

mHuBERT-147 is trained on over 90 K hours of speech in 147 languages – considerably lesser than its most comparable multilingual model XLS-R (128 languages, 436 K hours), and the larger scale MMS (1,406 languages, 491 K hours). Indeed, data quantity is a major bottleneck for HuBERT, since the training process requires discrete labels for the entire training dataset which, in turn, are produced from high-dimensional features. Extracting these features in a typical multi-iteration process can incur prohibitive expenses for massive datasets, in terms of both storage requirements and pre-processing times. For instance, the features we extracted for mHuBERT-147 occupied 48 TB of storage per iteration. We estimate that for the training set of XLS-R, this would take about 270 TB. While storage requirements can be reduced by loo** the feature extraction and labeling processes, this would increase pre-processing times. Appendix Section A.2 details hardware requirements for the different pre-processing and training steps.

4 ML-SUPERB evaluation

[Uncaptioned image] — Table 3: ML-SUPERB 10min/1h results. Current SOTA [48] is shown on the top portion of the table, our submission is shown in the middle, other relevant multilingual models are presented below it. Updated SOTA scores for each metric are presented in bold. The first ( \scalerel* X ), second ( \scalerel* X ) and third ( \scalerel* X ) best SUPERB scores are highlighted.

		Monolingual ASR	Multilingual ASR	LID	Multilingual ASR + LID
MMS-1B	965M	33.3 / 25.7	21.3 / 18.1	30.2 / 30.8	84.8 / 86.1	73.3 / 74.8	26.0 / 25.5	25.4 / 24.8	983.5 \scalerel* X / 948.1 \scalerel* X
NWHC1	317M	39.5 / 30.5	28.9 / 21.5	41.4 / 38.6	67.1 / 87.4	77.1 / 90.6	28.8 / 21.5	40.3 / 38.2	774.4 / 876.9 \scalerel* X
NWHC2	317M	39.5 / 30.5	29.3 / 21.6	42.0 / 39.3	64.4 / 88.1	77.4 / 90.6	28.4 / 21.8	41.5 / 38.8	759.9 / 873.3
mHuBERT-147-3rd	95M	34.2 / 26.3	23.6 / 22.0	33.2 / 32.9	85.3 / 91.0	81.4 / 90.0	26.2 / 22.1	34.9 / 33.5	949.8 \scalerel* X / 950.2 \scalerel* X
mHuBERT-147-2nd	95M	35.9 / 27.6	25.4 / 22.5	34.2 / 33.8	74.8 / 90.1	81.0 / 89.0	26.3 / 23.6	33.9 / 34.4	895.0 / 925.7
MMS-300M	317M	33.8 / 30.5	28.7 / 24.0	36.5 / 36.5	62.3 / 84.3	71.9 / 74.3	31.5 / 30.0	30.9 / 29.2	824.9 \scalerel* X / 844.3
XLS-R-300M	317M	39.7 / 30.6	29.2 / 22.0	40.9 / 39.3	66.9 / 87.9	55.6 / 85.6	28.4 / 22.9	42.1 / 42.4	730.8 / 850.5
WavLabLM-large-MS	317M	40.5 / 32.8	37.8 / 31.9	43.8 / 42.8	71.7 / 81.1	70.8 / 80.0	37.0 / 32.2	43.4 / 41.2	707.5 / 740.9

For evaluating the quality of the multilingual representations learned by mHuBERT-147, we use the ML-SUPERB benchmark [47]. This benchmark comprises two settings: 10min and 1h; and four tasks: monolingual ASR; multilingual ASR, LID, and joint multilingual ASR and LID. The monolingual setting constitutes 13 (language, domain) pairs, while the multilingual one has 240 pairs across 143 languages in total – of which 123 and 20 constitute the normal (10min/1h per language) and few-shot (5 utterances per language) evaluation settings respectively.

Setup.

The downstream architecture consists of learnable weights over the frozen SSL features, a CNN for reducing feature dimensionality by a half, and two Transformer layers ( $dim=256$ ; 8 attention heads). In line with the official guidelines,⁷⁷7Recipe at: https://github.com/espnet/espnet/tree/master/egs2/ml_superb/asr1 we only tune the learning rates (best of $1e-3;1e-4;1e-5$ ) on the validation set.⁸⁸8Appendix B.1 presents more information regarding optimization. Due to the high compute costs of evaluating on this benchmark, we do not retrain the other submissions we compare against and instead reuse the leaderboard scores from Shi et al. [48]. Therefore, we are unfortunately unable to provide confidence intervals, as these were not originally requested by the benchmark organizers. We also recompute the global SUPERB scores for all models following Shi et al. [47], since this metric leverages SOTA scores for normalization, and our model achieves new SOTA on three tasks (bold values in Table 3).⁹⁹9Our SUPERB score calculator is available at: https://github.com/utter-project/mHuBERT-147-scripts/blob/main/06_evaluation/mlsuperb/compute_superb_score.py

Results.

Table 3 presents results for the 10min/1h settings. The current SOTA models from Shi et al. [48] are shown at the top (NWHC models are variants of MMS-300M). Other relevant multilingual SSL models are displayed in the bottom portion: MMS-300M; XLS-R-300M, and the best WavLabLM model from Chen et al. [10] (136 languages; 40 K hours). We highlight that although XLS-R and MMS are referred in the literature as “300 M”, the correct parameter count is 317 M. We omit XLSR-53 results as these are consistently worse than XLS-R [47]. We present results for both our second (2nd) and third (3rd) mHuBERT-147 iterations.
Looking at the bottom rows, we notice that mHuBERT-147 outperforms the XLS-R and WavLabLM models across all tasks starting from its 2nd iteration. Compared to MMS-300M, both 2nd and 3rd iterations of our model are only surpassed in three tasks – few-shot ASR+LID 10min/1h, and monolingual ASR 10min – and by a small margin. These results highlight that the mHuBERT-147 training approach produces effective output starting from the 2nd iteration. They also motivate the interest in conducting a 3rd iteration, and we observe further improvement over all tasks between mHuBERT-147-2nd and mHuBERT-147-3rd scores.
Focusing on our best mHuBERT-147 model, and comparing it against the current SOTA (top rows), we see that this model is again very competitive, despite being trained on much lesser data. Our compact mHuBERT-147-3rd reaches the first position of the leaderboard for the 10min setting, and the second position for the 1h setting, being outperformed only by a model ten times larger (MMS-1B). We also set new SOTA ACC scores for LID 10min, LID 1h and Multilingual ASR+LID 10min, while being the only 95 M parameters model at the top of the leaderboard.

5 FLEURS-102 evaluation

We complement the ML-SUPERB evaluation by training monolingual ASR models on the FLEURS-102 dataset [14], competing with XLS-R and MMS (300M/1B). In this full fine-tuning few-shot setting, we simply add a linear projection to the target vocabulary on top of the pre-trained stack, optimizing the resulting model for ASR using approximately 10 hours of speech. Thus, unlike the experiments in the previous section, larger models will now have the benefit of having more parameters for adaptation. With these experiments, we want to illustrate that even in this unfavorable setting, mHuBERT-147 can still be competitive, while being faster at fine-tuning and inference time.

Setup.

We implement monolingual ASR using the transformers library [60]. All models were trained for 100 epochs on fp32 using V100-32GB/A100-80GB GPUs. Since individual language optimization would be prohibitively costly for this dataset, we select a subset of 29 languages, covering the different geographic groups, and optimize parameters using XLS-R. We use $1e-5$ as learning rate, warm-up ratio of $0.1$ and dropout of $0.1$ (300 M and 1 B) or $0.3$ (95 M). The increased dropout for the latter is due to the ASR models being considerably smaller (70% less parameters). We train three models per language with different seeds, adding up to four runs in cases of high variability in scores ( $\geq 20$ CER). We apply MMS transcript normalization [40], reporting CER averages over the two best runs.

Results.

Table 4 presents results grouped by FLEURS-102 geographic groups: Western Europe (WE); Eastern Europe (EE); Central-Asia/Middle-East/North-Africa (CMN); Sub-Saharan Africa (SSA); South-Asia (SA); South-East Asia (SEA); Isolates (CJK). Despite its small size, mHuBERT-147 achieves the best average over 102 languages in this setting, mainly due to its superior performance in the CMN, SA, and SEA groups. Further investigation reveals this superior performance is due to its greater robustness to few-shot adaptation compared to other models. For instance, compared to MMS-1B (See Appendix Table 11), mHuBERT-147 consistently shows higher CER averages, likely due to its smaller capacity. However, MMS-1B fails to achieve useful transcription (CER $>90$ ) for 16 languages, while mHuBERT-147 only fails for half of these.¹⁰¹⁰10Removing the aforementioned 16 languages from the computed average, we find that MMS-1B indeed produces the best FLEURS-102 scores (MMS-1B: 8.2, mHuBERT-147: 13.3). In other words, its overall superior performance is due to its representations converging more consistently in few-shot settings compared to other models. Appendix Section B.2 present more information regarding the FLEURS-102 results.

SSL

General

Avg (102)

(25)

(16)

CMN

(12)

SSA

(20)

(14)

SEA

(11)

CJK

(4)

MMS-1B

22.3

17.4

11.0

37.8

23.3

27.7

25.9

17.8

MMS-300M

24.9

19.5

12.5

39.8

24.8

29.1

29.5

35.8

XLS-R-300M

24.5

18.7

11.8

39.2

24.8

29.7

29.5

33.4

mHuBERT-147

21.1

18.7

15.3

23.4

22.7

25.5

19.8

31.5

Table 4: FLEURS-102 CER (

\downarrow

) geographic group averages, with number of languages between parentheses.

Training and Inference efficiency.

We measured fine-tuning (1 epoch) and inference efficiency (test set) across three runs and two languages (English and Kannada) using a RTX 3090-24GB in exclusive node execution mode. We find that mHuBERT-147, which has 70% less parameters, is respectively 1.8 and 3 times faster to fine-tune on average than 300M and 1B models. For test inference, our model is respectively 1.4 and 2 times faster on average than 300M and 1B models.

Overall, our results highlight our model as a compact but powerful solution for multilingual speech processing applications. Appendix Section B.3 present complementary results for the ESB benchmark.

6 Conclusion

In this paper we presented the first general-purpose multilingual HuBERT model. For training, we prioritized data quality over quantity – selecting diverse data sources and filtering noisy data, and defining a two-level multilingual batching up-sampling approach that considers both language and dataset diversity. For clustering, we leveraged faiss-optimized clustering for decreasing the label assignment costs. Despite being trained with considerably less data than popular multilingual models, and being a compact model, mHuBERT-147 is competitive with larger multilingual SSL models, reaching respectively second and first positions at ML-SUPERB 10min/1h leaderboards, and setting new SOTA for three LID metrics. Complementary ASR results on FLEURS-102 suggest that mHuBERT-147 is a promising model for multilingual speech downstream tasks, offering an unprecedented balance between high performance and parameter efficiency.

7 Acknowledgments

We thank the ML-SUPERB authors for providing us with assistance and clarification during our models evaluation. We would also like to thank our NAVER LABS Europe colleagues Caroline Brun and Salah Aït-Mokhtar for their assistance with the FLEURS-102 experiments.

This work was funded by the European Union’s Horizon Europe (HE) Research and Innovation programme under Grant Agreement No 101070631 and from the UK Research and Innovation (UKRI) under the UK government’s HE funding grant No 10039436. This document is licensed under CC-BY-4.0.

[Uncaptioned image]

References

Ardila et al. [2020] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215, 2020.
Babu et al. [2022] Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, and Michael Auli. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. In Proc. Interspeech 2022, pages 2278–2282, 2022.
Baevski et al. [2020] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
Bu et al. [2017] Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pages 1–5, 2017.
Carletta [2007] Jean Carletta. Unleashing the killer corpus: experiences in creating the multi-everything ami meeting corpus. Language Resources and Evaluation, 41:181–190, 2007.
Chang et al. [2023] Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, **chuan Tian, Shinji Watanabe, et al. Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study. arXiv preprint arXiv:2309.15800, 2023.
Chen et al. [2021] Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021.
Chen et al. [2022] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, **yu Li, Naoyuki Kanda, Takuya Yoshioka, and Xiong Xiao. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
Chen et al. [2023a] William Chen, Xuankai Chang, Yifan Peng, Zhaoheng Ni, Soumi Maiti, and Shinji Watanabe. Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute. In Proc. INTERSPEECH 2023, pages 4404–4408, 2023a.
Chen et al. [2023b] William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, and Shinji Watanabe. Joint prediction and denoising for large-scale multilingual self-supervised learning. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023b.
Cho et al. [2021] Won Ik Cho, Seok Min Kim, Hyunchang Cho, and Nam Soo Kim. kosp2e: Korean Speech to English Translation Corpus. In Proc. Interspeech 2021, pages 3705–3709, 2021.
Chou et al. [2023] Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, and Michael Auli. Toward joint language modeling for speech units and text. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6582–6593, Singapore, 2023. Association for Computational Linguistics.
Conneau et al. [2021] Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Proc. Interspeech 2021, pages 2426–2430, 2021.
Conneau et al. [2023] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE, 2023.
Del Rio et al. [2022] Miguel Del Rio, Peter Ha, Quinten McNamara, Corey Miller, and Shipra Chandra. Earnings-22: A practical benchmark for accents in the wild. arXiv preprint arXiv:2203.15591, 2022.
Doukhan et al. [2018] David Doukhan, Jean Carrive, Félicien Vallet, Anthony Larcher, and Sylvain Meignier. An open-source speaker gender detection framework for monitoring gender equality. In Acoustics Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. IEEE, 2018.
Douze et al. [2024] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. 2024.
Duquenne et al. [2023] Paul-Ambroise Duquenne, Hongyu Gong, Ning Dong, **gfei Du, Ann Lee, Vedanuj Goswami, Changhan Wang, Juan Pino, Benoît Sagot, and Holger Schwenk. SpeechMatrix: A large-scale mined corpus of multilingual speech-to-speech translations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16251–16269, Toronto, Canada, 2023. Association for Computational Linguistics.
Gales et al. [2014] Mark JF Gales, Kate M Knill, Anton Ragni, and Shakti P Rath. Speech recognition and keyword spotting for low-resource languages: Babel project research at cued. In Fourth International workshop on spoken language technologies for under-resourced languages (SLTU-2014), pages 16–23. International Speech Communication Association (ISCA), 2014.
Gandhi et al. [2022] Sanchit Gandhi, Patrick Von Platen, and Alexander M Rush. Esb: A benchmark for multi-domain end-to-end speech recognition. arXiv preprint arXiv:2210.13352, 2022.
Ha et al. [2020] Jung-Woo Ha, Kihyun Nam, **gu Kang, Sang-Woo Lee, Sohee Yang, Hyunhoon Jung, Hyeji Kim, Eunmi Kim, Soo** Kim, Hyun Ah Kim, Kyoungtae Doh, Chan Kyu Lee, Nako Sung, and Sunghun Kim. ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers. In Proc. Interspeech 2020, pages 409–413, 2020.
Hernandez et al. [2018] François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Esteve. Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20, pages 198–208. Springer, 2018.
Hsu et al. [2021] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
Iida [2022] Katsuya Iida. Kokoro Speech Dataset. https://github.com/kaiidams/Kokoro-Speech-Dataset, 2022. [Online; accessed 21-Feb-2024].
Josh Meyer and David Adelani and Edresson Casanova and Alp Öktem and Daniel Whitenack and Julian Weber and Salomon Kabongo Kabenamualu and Elizabeth Salesky and Iroro Orife and Colin Leong and Perez Ogayo and Chris Chinenye Emezue and Jonathan Mukiibi and Salomey Osei and Apelete AGBOLO and Victor Akinode and Bernard Opoku and Olanrewaju Samuel and Jesujoba Alabi and Shamsuddeen Hassan Muhammad [2022] Josh Meyer and David Adelani and Edresson Casanova and Alp Öktem and Daniel Whitenack and Julian Weber and Salomon Kabongo Kabenamualu and Elizabeth Salesky and Iroro Orife and Colin Leong and Perez Ogayo and Chris Chinenye Emezue and Jonathan Mukiibi and Salomey Osei and Apelete AGBOLO and Victor Akinode and Bernard Opoku and Olanrewaju Samuel and Jesujoba Alabi and Shamsuddeen Hassan Muhammad. BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus. In Proc. Interspeech 2022, pages 2383–2387, 2022.
Kolobov et al. [2021] Rostislav Kolobov, Olga Okhapkina, Andrey Platunov Olga Omelchishina, Roman Bedyakin, Vyacheslav Moshkin, Dmitry Menshikov, and Nikolay Mikhaylovskiy. Mediaspeech: Multilanguage asr benchmark and dataset, 2021.
Lagos and Calapodescu [2024] Nikolaos Lagos and Ioan Calapodescu. Unsupervised multi-domain data selection for asr fine-tuning. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10711–10715, 2024.
Lee et al. [2022] Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen Chen, Changhan Wang, Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, and Wei-Ning Hsu. Textless speech-to-speech translation on real data. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 860–872, Seattle, United States, 2022. Association for Computational Linguistics.
Li et al. [2022] Pengcheng Li, Genshun Wan, Fenglin Ding, Hang Chen, Jianqing Gao, Jia Pan, and Cong Liu. Improved speech pre-training with supervision-enhanced acoustic unit. arXiv preprint arXiv:2212.03482, 2022.
Lin et al. [2023] Tzu-Quan Lin, Hung-yi Lee, and Hao Tang. Melhubert: A simplified hubert on mel spectrograms. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
Liu et al. [2020] Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6419–6423. IEEE, 2020.
Madhavaraj Ayyavu and Bharathi Pilar and A.G. Ramakrishnan [2022a] Madhavaraj Ayyavu and Bharathi Pilar and A.G. Ramakrishnan. Subword dictionary learning and segmentation techniques for automatic speech recognition in tamil and kannada, 2022a.
Madhavaraj Ayyavu and Bharathi Pilar and A.G. Ramakrishnan [2022b] Madhavaraj Ayyavu and Bharathi Pilar and A.G. Ramakrishnan. Knowledge-driven subword grammar modeling for automatic speech recognition in tamil and kannada, 2022b.
Ott et al. [2019] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
Panayotov et al. [2015a] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015a.
Panayotov et al. [2015b] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015b.
Parcollet et al. [2024] Titouan Parcollet, Ha Nguyen, Solène Evain, Marcely Zanon Boito, Adrien Pupier, Salima Mdhaffar, Hang Le, Sina Alisamir, Natalia Tomashenko, Marco Dinarelli, Shucong Zhang, Alexandre Allauzen, Maximin Coavoux, Yannick Estève, Mickael Rouvier, Jerôme Goulian, Benjamin Lecouteux, François Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, and Laurent Besacier. Lebenchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of french speech. Computer Speech and Language, 86:101622, 2024.
Pascual et al. [2019] Santiago Pascual, Mirco Ravanelli, Joan Serrà, Antonio Bonafonte, and Yoshua Bengio. Learning problem-agnostic speech representations from multiple self-supervised tasks. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, pages 161–165, 2019.
Pratap et al. [2020] Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. MLS: A Large-Scale Multilingual Dataset for Speech Research. In Proc. Interspeech 2020, pages 2757–2761, 2020.
Pratap et al. [2023] Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516, 2023.
Ravanelli et al. [2020] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, João Monteiro, Jan Trmal, and Yoshua Bengio. Multi-task self-supervised learning for robust speech recognition. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6989–6993, 2020.
Ravanelli et al. [2021] Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. SpeechBrain: A general-purpose speech toolkit, 2021. arXiv:2106.04624.
Renals et al. [2007] Steve Renals, Thomas Hain, and Hervé Bourlard. Recognition and understanding of meetings the ami and amida projects. In 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pages 238–247. IEEE, 2007.
Rivière et al. [2020] Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel Dupoux. Unsupervised pretraining transfers well across languages. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7414–7418, 2020.
Roze et al. [2015] Askar Roze, Shi Yin, Zhiyong Zhang, Dong Wang, and Askar Hamdulla. Thugy20: A free uyghur speech database. In NCMMSC’15, 2015.
Rozi et al. [2015] Askar Rozi, Dong Wang, and Zhiyong Zhang. An open/free database and benchmark for uyghur speaker recognition. In O-COCOSDA’15, 2015.
Shi et al. [2023a] Jiatong Shi, Dan Berrebbi, William Chen, En-Pei Hu, Wei-** Huang, Ho-Lam Chung, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung yi Lee, and Shinji Watanabe. ML-SUPERB: Multilingual Speech Universal PERformance Benchmark. In Proc. INTERSPEECH 2023, pages 884–888, 2023a.
Shi et al. [2023b] Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-** Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, et al. Findings of the 2023 ml-superb challenge: Pre-training and evaluation over more languages and beyond. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023b.
Shi et al. [2015] Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. 2015.
Sodimana et al. [2018] Keshan Sodimana, Knot Pipatsrisawat, Linne Ha, Martin Jansche, Oddur Kjartansson, Pasindu De Silva, and Supheakmungkol Sarin. A Step-by-Step Process for Building TTS Voices Using Open Source Data and Framework for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese. In Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), pages 66–70, Gurugram, India, 2018.
Staffan Hedström, Ragnheiður Þórhallsdóttir, David Erik Mollberg, Smári Freyr Guðmundsson, Ólafur Helgi Jónsson, Sunneva Þorsteinsdóttir, Judy Y. Fong, Eydís Huld Magnúsdóttir, Jon Gudnason [2022] Staffan Hedström, Ragnheiður Þórhallsdóttir, David Erik Mollberg, Smári Freyr Guðmundsson, Ólafur Helgi Jónsson, Sunneva Þorsteinsdóttir, Judy Y. Fong, Eydís Huld Magnúsdóttir, Jon Gudnason. Samrómur Unverified 22.07. Reykjavik University: Language and Voice Lab, 2022.
Takamichi et al. [2019] Shinnosuke Takamichi, Kentaro Mitsui, Yuki Saito, Tomoki Koriyama, Naoko Tanji, and Hiroshi Saruwatari. Jvs corpus: free japanese multi-speaker voice corpus. arXiv preprint arXiv:1908.06248, 2019.
Valk and Alumäe [2021] Jörgen Valk and Tanel Alumäe. VoxLingua107: a dataset for spoken language recognition. In Proc. IEEE SLT Workshop, 2021.
van Niekerk et al. [2017] Daniel van Niekerk, Charl van Heerden, Marelie Davel, Neil Kleynhans, Oddur Kjartansson, Martin Jansche, and Linne Ha. Rapid development of TTS corpora for four South African languages. In Proc. Interspeech 2017, pages 2178–2182, Stockholm, Sweden, 2017.
Wang et al. [2021] Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 993–1003, Online, 2021. Association for Computational Linguistics.
Wang et al. [2015] Dong Wang, Xuewei Zhang, and Zhiyong Zhang. Thchs-30 : A free chinese speech corpus. 2015.
Wang et al. [2023] Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, **yu Li, and Furu Wei. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107, 2023.
Watanabe et al. [2018] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai. ESPnet: End-to-end speech processing toolkit. In Proceedings of Interspeech, pages 2207–2211, 2018.
wen Yang et al. [2021] Shu wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung yi Lee. SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198, 2021.
Wolf et al. [2020] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, 2020. Association for Computational Linguistics.

Appendix A Pre-training experiments

A.1 Faiss-based clustering

For evaluating our hypothesis that the mini-batch k-means proposed in the original HuBERT recipe can be replaced with the more efficient faiss IVF clustering (refer Section 3.2), we trained two HuBERT base models [23] from scratch - one with k-means clustering and the other with faiss indices. We follow the recipe available in the fairseq library [34]¹¹¹¹11Available at: https://github.com/facebookresearch/fairseq/blob/main/examples/hubert/README.md for training the HuBERT models. The models are trained for two iterations on the LibriSpeech dataset [36]. We use the same number of centroids as the original recipe – $K=100$ in the first iteration, and $K=500$ in the second.¹²¹²12Model hyper-parameters are available at: https://github.com/facebookresearch/fairseq/blob/main/examples/hubert/config/pretrain/hubert_base_librispeech.yaml

Clustering performance.

We find that, in this setting, both approaches take the same time to train their labels when using Librispeech train split with $K=500$ in the second iteration (2.33 hours). However, as mentioned previously (Section 3.2.1), after four days, we failed to finish training a sklearn-based k-means using more data (850 GB) and higher K ( $K=1000$ ). For faiss, the same setting took approximately 40 hours.

Labeling performance.

Clustering application using sklearn takes 1.8 min per 10 h of speech, whereas it takes faiss only 21 s to label the same amount of speech data. This makes the former approach 5.2 times slower. The speed up in the latter is mainly due to the Hierarchical Navigable Small Worlds (HNSW) optimized graph search algorithm.

LibriSpeech Results.

We train models on the 100 h split from LibriSpeech, using the Hugging Face transformers library [60]. We train for 30 epochs on fp16 precision, adding a linear projection to the target vocabulary on top of the pre-trained stacks. We explore two fine-tuning approaches: all (wherein all weights are updated) or weighted (wherein only downstream layers and a weighted sum over the pre-trained representations are updated). We tune the learning rate for each model (best of $1e-2;1e-3;1e-4;1e-5$ ), selecting the best result using valid-clean. We train models twice, varying the random seed, and present average results. Table 5 presents results for the original HuBERT model¹³¹³13Available at: https://huggingface.co/facebook/hubert-base-ls960 (original), our reproduction of the same model (sklearn), and the faiss-based clustering version (faiss).
We observe that both models trained from scratch (sklearn and faiss) present a slight decrease in performance compared to the original pre-trained model. We believe this could be due to differences in data selection for the clustering steps, which is performed at random using 20% of the training data. Our models are also trained using the random crop batching approach, while the original implementation might have used padding.¹⁴¹⁴14The shared recipe code mentions that “v1” HuBERT used padding, but we could not find information in the paper regarding this. Finally, focusing on models trained under the same settings, we observe that performance for sklearn and faiss models is very similar. This suggests that faiss IVFs can be leveraged for HuBERT training without performance degradation. In Appendix Section A.4.2 we illustrate that in multilingual settings, and by increasing K for faiss, we also produce equally performing speech representation models by using either sklearn or faiss clustering.

	FT	test-clean ( $\downarrow$ )	test-other ( $\downarrow$ )
original	all	7.1	15.3
sklearn		8.4	17.1
faiss		7.9	17.1
original	weighted	48.8	57.1
sklearn		50.5	59.0
faiss		50.2	59.2

Table 5: Average WER test results for the Librispeech dataset using different English 2nd iteration HuBERT models.

A.2 Hardware considerations

In this section, we detail the different hardware requirements for training our mHuBERT-147 model. Table 6 summarizes the key information. The next few paragraphs elaborate on these costs.

Task

RAM/job

CPU/job

GPU/job

Total Disk

Total Processing

Time

Speech

9.6 TB

Speech Features iteration 1

4.7 TB

Speech Features iteration 2-3

50-500 GB

1x V100-32 GB

48 TB

3-5 days

faiss Index Training

2 TB

32x Intel(R) Xeon(R)

Platinum 8452Y

3.6 GB

2 days

faiss Index Application

50-500 GB

4-8 cores (any)

37 GB

2-3 days

Manifest/Labels

65 GB

Model training (iteration)

2 TB/node

64 x AMD EPYC

7313 16-Core Processor

32x A100-80GB

35 GB

20 days

Table 6: Overview of the hardware requirements necessary per job (RAM; CPU; GPU), total disk required to store the output of the tasks, and total estimated processing time. The numbers in the left illustrate the hierarchy of processes, with larger numbers depending on the completion of the previous processes.

Data storage requirements.

For storing mHuBERT-147’s 90 K hours of speech, 9.6 TB of disk storage are required. For the first iteration, an additional 4.7 TB is required for storing the extracted MFCC features for clustering and label application. For the second and third iterations, substantially more storage is needed – as the 39-dimensional features are replaced by 768-dimensional vectors. As a result, our storage requirements increased roughly 10 times, reaching about 48 TB. Fortunately, the storage of feature vectors is not a requirement for training, and so in our pre-processing protocol, we performed iterative extraction for labeling, avoiding having more than 10 TB of features vectors at a given moment on our clusters. Thus, we can consider that the upper bound of storage required to train a mHuBERT-147 model with 90 K hours is around 58 TB, and we can thus estimate that training a model with the same amount of speech data as XLS-R could necessitate up to 280 TB of storage space. While one can cut storage costs during pre-processing, this comes with an increase in pre-processing time. The iteration between extraction and labeling, followed by feature deletion, might also extend the duration of GPU idle time between consecutive iterations.

GPU requirements for pre-processing.

GPUs are required to extract features for the entire corpus, as these are needed to produce the target labels. The cost of this extraction can be greatly reduced with data sharding. On average, 1 h of speech vectors requires 9 s in a V100-32GB. Considering 12 parallel jobs running in V100-32GB GPU, with optimum sharding 90 K hours of speech data could be processed in approximately 19 h. However, this is an over-optimistic lower-bound, as sharding will never be optimal unless it is performed cross-dataset (mixing data from different datasets during extraction process), which makes data management a more complex task. Moreover, this also requires the upper-bound of storage requirements discussed above. For mHuBERT-147, we shard feature extraction only for datasets containing over 20,000 utterances for a given language, so as to allow us to more effectively parallelize the labeling task. This results in an approximate extraction time of 3 to 5 days for the entire mHuBERT-147 training corpus.

Clustering requirements.

In the original HuBERT paper, this step is performed by clustering with mini-batch k-means. In Section 3.2 we discuss replacing this approach with faiss indices, that are much more scalable. We sample training data using the same distribution used for up-sampling during multilingual training (Section 3.1), and by having a fixed RAM target of around 850 GB, which allows us to train indices in 2 TB RAM machines.¹⁵¹⁵15The faiss index creation implementation requires almost twice the amount of RAM used for loading the data in order to train its indices. Index training using the “OPQ16_64,IVF1000_HNSW32,PQ16x4fsr” option takes approximately 40 h running on exclusivity mode on a node with 32 Intel(R) Xeon(R) Platinum 8452Y and 2 TB RAM. Index application (i.e. labeling of speech feature vectors) takes 21 s per 10 h of speech transformed into numpy feature vectors, costing about 53.65 processing hours for the totality of the training data. Index application is a RAM-bound process, where loading of the numpy features occupies the majority of the processing time, and thus, with increased sharding and parallel processing, index application takes negligible processing time.

Training requirements.

Training was performed using the NAVER Cloud platform.¹⁶¹⁶16https://www.ncloud.com/ We use 32x A100-80 GB GPUs running on 4 parallel nodes. Regarding RAM requirements, considering 8 parallel data loaders, manifest lists and label offsets stored in memory, 1.5 TB of RAM is needed. While kee** labels directly in RAM would increase batching speed, this would require an additional 2 TB of RAM. Storage needs are limited to the speech data (9.6 TB), corresponding manifest and label files (65 GB) and checkpoints (1.5 GB per checkpoint).

A.3 Loss curves and training instability

The loss curves for the three mHuBERT-147 iterations are presented in Figure 2. We observe that while the loss decrease from the 1st to the 2nd iteration is very pronounced, it seems to be more limited between the 2nd and the 3rd iterations. We hypothesize that the model might be starting to saturate at the end of the 3rd iteration, and might need extra capacity for improving further.

Different from our previous experiences training wav2vec 2.0 models, we find that our mHuBERT-147 models tend to be more stable during training. We do, however, reach training instability at the end of the 3rd iteration. For solving gradient explosion, similarly to wav2vec 2.0 (see discussion in Parcollet et al. [37]), we find that the most effective solution is increasing floating point precision and restarting the optimizer from the previous learning rate.

A.4 Pre-training hyper-parameters

In this section, we cover the different hyper-parameters chosen for the training of mHuBERT-147, comparing intermediate checkpoints obtained using different settings. As a proxy evaluation of the quality of the speech representations obtained, we use the 1 h phoneme recognition task from Rivière et al. [44]. This setting allows us to investigate the emergence of multilingual phonemic content on our models through different iterations and training updates.

Proxy evaluation.

We train monolingual few-shot phoneme recognition systems using the 10 available languages: Spanish (es), French (fr), Italian (it), Kyrgyz (ky), Dutch (nl), Russian (ru), Swedish (sv), Turkish (tt), Tatar (tt), Chinese (zh-HK). We follow the standard data split, using 1 h for training, validation and test. Our models are trained using the Speechbrain library [42], with a standard CTC fine-tuning approach employing two different learning rates: $1e-5$ for the pre-trained SSL model, and $1e-4$ for the preceding 2 linear layers and output projection. We use a dropout of 0.6 and train for 100 epochs.¹⁷¹⁷17Recipe at: https://github.com/utter-project/mHuBERT-147-scripts/tree/main/06_evaluation/CV_PR.

Average

PER (

\downarrow

)

zh-HK

sklearn (

K=500

)

14.8

10.8

13.7

15.0

11.5

15.8

16.6

19.2

13.2

9.5

22.3

faiss (

K=1000

)

14.6

10.7

13.6

14.8

10.9

15.5

16.4

19.3

13.2

9.4

22.5

random crop

14.4

10.4

12.9

14.1

10.9

14.5

16.2

19.3

13.1

9.4

22.8

padding

23.6

18.0

23.7

22.9

20.1

25.4

25.0

29.7

23.0

17.1

30.9

Table 7: PER scores for the proxy task. Models on top and bottom portions are not comparable due to different up-sampling settings and maximum number of updates (top: 800 K, bottom: 2 M). Top portion: 1st iteration mHuBERT-147 models (

\alpha=0.5

;

\beta=0.8

) trained for 800 K updates using different clustering implementations: sklearn mini-batch k-means (

K=500

), or faiss IVF (

K=1000

). Bottom portion: 1st iteration faiss-based mHuBERT-147 models (

\alpha=0.7

;

\beta=0.9

) trained for 400 K updates using random crop or padding as batching approach.

A.4.1 Impact of number of updates and iterations

We train our iterations for 2 M updates, which is considerably larger than the standard approach (between 100-500 K updates [23, 28]). In this section, we present evidence that pre-training longer is necessary for improved multilingual representation learning.

Figure 3 presents mHuBERT-147 proxy performance across different iterations, and on varying number of updates.¹⁸¹⁸18Linear interpolation is used to build our plot as evaluation across updates did not always fetch checkpoints corresponding to the exact same number of updates. Nonetheless, we highlight in bold the points at which checkpoints were evaluated across all iterations. Results are average PER scores computed across ten languages. We highlight performance at 400 K, 1 M, 1.5 M and 2 M.

Despite the loss curve looking stable after 400 K (Figure 2), we observe that the models keep improving internal multilingual phonemic representation until the very end of the 2 M updates. This is true for all iterations but the 3rd, where we achieve for the first time a best score that does not correspond to the last checkpoint: 8.1 PER at update 1.975 M. We believe this is probably an artifact of the multilingual batching not favoring the specific subset of languages present in the proxy evaluation during the very last training batches. This can also be a result linked to model saturation.

Focusing on proxy results for the 1st and 2nd iterations, we observe that after 400 K updates, the 2nd iteration model produces scores similar to that at the end of the 1st iteration. A longer training allows this model to yield an overall performance improvement of 21.6% over the best score for the 1st iteration. Regarding the 3rd iteration, we observe that the proxy task improvement is not as significant compared to the 2nd iteration: the 3rd iteration model only surpasses the best 2nd iteration scores after 1.5 M updates. This hints at the saturation of the multi-iteration HuBERT training approach at the 3rd step. We believe that more capacity (i.e. going from base to large architecture) could allow the model to improve more substantially at this point.

Finally, we note that although this proxy task allows us to assess the model’s potential during training, its scores are not completely correlated to the final model’s downstream performance, mostly due to the limitation in covered languages (only 10). For instance, although the 3rd iteration seems to improve only 0.5 PER over the second iteration, we observe significant improvement of ML-SUPERB scores when using the 3rd over the 2nd iteration model (Section 4).

A.4.2 K-means versus faiss on multilingual settings

We train two 1st iteration mHuBERT-147 models for 800 K updates, varying the K-means implementation between the sklearn ( $K=500$ ) and faiss ( $K=1000$ ). We select $K=500$ for sklearn because, as mentioned in Section 3.2.1, we were unable to train sklearn K-means models for our dataset using larger K values. Comparing models with different K values could result in a more favorable setting for faiss, but we believe this comparison can still be made using 1st iteration models. This is because at this step, clustering is performed using MFCCs, which are low-dimensional. Thus, the extra expressivity that faiss-based clustering has at this iteration should not have a significant impact in the quality of the produced discretization.

The top portion of Table 7 presents the results using our proxy evaluation setting for the two pre-trained 1st iteration models.¹⁹¹⁹19We highlight that these scores are not comparable to the ones in Section A.4.1, as for this experiment, the up-sampling is more aggressive ( $\alpha=0.5$ ; $\beta=0.8$ ). We once again observe that faiss-based mHuBERT-147 training does not result in inferior speech representation modeling.

A.4.3 Padding versus random crop

We investigate the impact of random crop during multilingual training, compared to the more classical approach of padding. We train a first iteration padded mHuBERT-147 model for 400 K updates, and compare it to the model using random crop at the same training state (same number of seen examples). The bottom portion of Table 7 presents results for our proxy task. We observe that speech representation learning equipped with random crop batching mechanism outperforms quite drastically the padding approach at lower number of updates (400 K). We believe this is due to the higher example diversity for the random crop setting.

A.4.4 Two-level language-source up-sampling

The hyper-parameters for the two-level language-source up-sampling were chosen considering the language-source distribution in our data collection. The following are our observations regarding these hyper-parameters.

Language up-sampling.

Aggressive language up-sampling will result in a better multilingual balance in terms of language exposure, but it can result in poor utterance diversity during a given epoch. This issue comes from the fact that low-resource languages have a limited amount of examples, and thus aggressive up-sampling will allocate a large chunk of the batches for the same examples. While repeating examples in low-resource languages is the goal of up-sampling, we find that over-exposure to the same limited number of examples will result in overall multilingual speech representation degradation. For this reason, during our hyper-parameter search, we moved from $\alpha=0.5$ (41% of a given epoch allocated to repeated examples) to $\alpha=0.7$ (20% of a given epoch allocated to repeated examples).

Source up-sampling.

We find that this parameter is more important for high-resource languages, as low-resource languages are often single-sourced in our multilingual dataset. As we filtered overly large datasets for high-resource languages prior to training (Section 2), we employ a mild source up-sampling.

A.5 Open-source links

The mHuBERT-147 models are made available under the CC-BY-NC-4.0 license. Our model release include the faiss trained IVFs, the fairseq checkpoint and the corresponding HuggingFace checkpoint.

•

mHuBERT-147 (-3rd): https://huggingface.co/utter-project/mHuBERT-147
•

mHuBERT-147-2nd: https://huggingface.co/utter-project/mHuBERT-147-base-2nd-iter
•

mHuBERT-147-1st: https://huggingface.co/utter-project/mHuBERT-147-base-1st-iter
•

Manifest files: https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest
•

Training scripts (fairseq fork): https://github.com/utter-project/fairseq
•

Pre-processing and evaluation scripts (including faiss scripts): https://github.com/utter-project/mHuBERT-147-scripts

Appendix B Downstream experiments

B.1 ML-SUPERB experiments

For evaluation on ML-SUPERB, we follow the recipe provided in the ESPnet library [58]. We use their Hugging Face interface for fine-tuning our mHuBERT-147 model.²⁰²⁰20More information available at: https://github.com/espnet/espnet/tree/master/egs2/ml_superb/asr1 We do implement an ML-SUPERB score calculator, since we could not find one provided by the organizers.²¹²¹21Available at: https://github.com/utter-project/mHuBERT-147-scripts/blob/main/06_evaluation/mlsuperb/compute_superb_score.py

Learning Rate.

Following the official ML-SUPERB guidelines, we only optimize the learning rate parameter (best of $1e-3;1e-4;1e-5$ ). Table 8 presents the list of learning rates selected by evaluating models on the validation set. The scores presented in Table 3 correspond to models trained with the corresponding learning rate.
We highlight that our experimentation showed that it is quite important to evaluate models with multiple learning rates. We show test LID results in Table 9 for mHuBERT-147-3rd trained using different learning rates. By varying the learning rate, we go from mediocre LID results to SOTA scores for this ML-SUPERB task.

Monolingual ASR Task.

Table 10 present test CER results per language for mHuBERT-147-3rd, in both 10 min and 1 h settings. For each language, we train three models, varying the learning rate. We use the development set to select the final model for our submission (see Table 8). During training, we observed many instabilities for this particular setting, with models sometimes failing to converge. In these cases, we find that restarting training with a new random seed fixed the problem. We highlight that seeds vary across languages.

Monolingual

ASR

Multilingual

ASR

LID

Multilingual

ASR+LID

10 min

1e-3

(GROUP A)

1e-3

1e-4

1e-3

1e-4

(GROUP B)

1 h

1e-3

1e-4

Table 8: Selected learning rates for mHuBERT-147-3rd using validation scores for the corresponding tasks. For monolingual ASR, GROUP A corresponds to: eng1, eng2, eng3, deu1, rus, swa, jpn, cmn, and GROUP B corresponds to: fra1, fra2, deu2, swe, xty.

		ACC ( $\uparrow$ )
10 min	1e-3	49.68
	1e-4	85.33
	1e-5	69.56
1 h	1e-3	51.06
	1e-4	90.95
	1e-5	81.35

Table 9: ML-SUPERB 10 min and 1 h LID ACC test scores for mHuBERT-147-3rd by varying the learning rate for training.

Final

Score

eng1

eng2

eng3

fra1

fra2

deu1

deu2

rus

swa

swe

jpn

cmn

xty

10min

34.2

30.9

42.3

37.0

43.2

51.0

25.2

36.9

26.4

24.7

29.9

13.9

34.7

63.0

26.3

22.6

33.6

27.3

29.5

36.1

20.2

24.3

20.5

19.8

21.6

10.4

23.8

57.9

Table 10: Detailed mHuBERT-147-3rd test CER (

\downarrow

) results for the languages inside the ML-SUPERB monolingual track. The final score is obtained by averaging all runs belonging to the same language, and then averaging this score across all different languages.

B.2 FLEURS-102 experiments

Table 11 present average CER results for all languages in FLEURS-102.

Failed ASR models.

We find that MMS and XLS-R models fail to converge to useful translation (CER $>90$ ) for 16 languages (ara, ceb, ckb, cym, fas, ful, heb, hrv, hye, mri, nya, oci, snd, som, tam, tel). For mHuBERT-147, this only happens for half of these languages (cym, fas, hye, nya, oci, snd, som, tel). There is no instance where mHuBERT-147 fails to converge but other models successfully converge.

Seen versus unseen languages.

We use the model closest to ours in language coverage (XLS-R) to compare model performance for seen and unseen languages. Table 12 presents the results. We observe that while being much more compact, mHuBERT-147 presents overall better performance in both seen and unseen languages.

	afr	amh	ara	asm	ast	aze	bel	ben	bos	bul	cat	ceb
MMS-1B	9.5 $\pm$ 0.02	7.9 $\pm$ 0.13	99.1 $\pm$ 0.0	9.8 $\pm$ 0.02	5.7 $\pm$ 0.13	6.0 $\pm$ 0.01	4.4 $\pm$ 0.1	7.0 $\pm$ 0.07	5.0 $\pm$ 0.03	4.2 $\pm$ 0.02	4.5 $\pm$ 0.01	86.8 $\pm$ 6.43
MMS-300M	12.5 $\pm$ 0.09	9.4 $\pm$ 0.13	98.7 $\pm$ 0.37	12.4 $\pm$ 0.08	7.0 $\pm$ 0.01	7.8 $\pm$ 0.03	5.8 $\pm$ 0.0	8.0 $\pm$ 0.1	6.5 $\pm$ 0.08	5.6 $\pm$ 0.05	6.1 $\pm$ 0.08	99.3 $\pm$ 0.0
XLS-R	12.0 $\pm$ 0.73	10.6 $\pm$ 0.31	99.1 $\pm$ 0.0	12.6 $\pm$ 0.07	6.4 $\pm$ 0.02	8.0 $\pm$ 0.15	5.6 $\pm$ 0.04	8.4 $\pm$ 0.0	5.7 $\pm$ 0.09	4.8 $\pm$ 0.18	5.2 $\pm$ 0.04	99.3 $\pm$ 0.0
mHuBERT-147	15.3 $\pm$ 0.74	12.1 $\pm$ 0.21	9.8 $\pm$ 0.16	15.1 $\pm$ 0.18	10.2 $\pm$ 0.24	11.3 $\pm$ 0.33	8.0 $\pm$ 0.11	9.6 $\pm$ 0.13	9.3 $\pm$ 0.13	8.5 $\pm$ 0.05	8.4 $\pm$ 0.02	57.0 $\pm$ 43.03

	ces	ckb	cmn	cym	dan	deu	ell	eng	est	fas	fil	fin
MMS-1B	4.6 $\pm$ 0.07	97.4 $\pm$ 1.82	17.9 $\pm$ 0.58	99.3 $\pm$ 0.0	9.1 $\pm$ 0.11	3.9 $\pm$ 0.02	6.1 $\pm$ 0.35	6.4 $\pm$ 0.26	3.5 $\pm$ 0.04	99.6 $\pm$ 0.0	4.4 $\pm$ 0.11	2.9 $\pm$ 0.06
MMS-300M	6.8 $\pm$ 0.04	98.0 $\pm$ 1.24	30.1 $\pm$ 0.07	99.3 $\pm$ 0.0	12.4 $\pm$ 0.02	6.0 $\pm$ 0.1	8.3 $\pm$ 0.2	8.8 $\pm$ 0.4	4.4 $\pm$ 0.15	99.6 $\pm$ 0.0	5.4 $\pm$ 0.14	3.7 $\pm$ 0.02
XLS-R-300M	5.7 $\pm$ 0.08	94.9 $\pm$ 4.32	27.9 $\pm$ 1.63	99.3 $\pm$ 0.0	11.1 $\pm$ 0.22	5.0 $\pm$ 0.04	7.5 $\pm$ 0.24	7.8 $\pm$ 0.11	3.9 $\pm$ 0.16	96.0 $\pm$ 3.59	5.4 $\pm$ 0.13	3.3 $\pm$ 0.04
mHuBERT-147	11.1 $\pm$ 0.1	58.5 $\pm$ 41.52	30.7 $\pm$ 1.15	99.3 $\pm$ 0.0	17.6 $\pm$ 0.09	9.6 $\pm$ 0.14	13.5 $\pm$ 0.04	11.0 $\pm$ 0.14	6.4 $\pm$ 0.04	100.0 $\pm$ 0.0	7.2 $\pm$ 0.2	5.4 $\pm$ 0.01

	fra	ful	gle	glg	guj	hau	heb	hin	hrv	hun	hye	ibo
MMS-1B	7.0 $\pm$ 0.1	98.4 $\pm$ 0.86	24.3 $\pm$ 0.1	3.8 $\pm$ 0.09	6.7 $\pm$ 0.24	8.2 $\pm$ 0.06	99.5 $\pm$ 0.12	7.2 $\pm$ 0.08	99.4 $\pm$ 0.0	5.8 $\pm$ 0.02	99.3 $\pm$ 0.0	13.1 $\pm$ 0.02
MMS-300M	10.5 $\pm$ 0.16	103.8 $\pm$ 4.62	29.2 $\pm$ 0.08	5.1 $\pm$ 0.0	8.0 $\pm$ 0.02	9.2 $\pm$ 0.01	109.4 $\pm$ 10.43	9.3 $\pm$ 0.06	99.4 $\pm$ 0.0	9.2 $\pm$ 0.09	98.6 $\pm$ 0.67	14.7 $\pm$ 0.11
XLS-R-300M	9.0 $\pm$ 0.21	94.8 $\pm$ 3.23	29.1 $\pm$ 0.34	4.6 $\pm$ 0.03	8.5 $\pm$ 0.12	10.1 $\pm$ 0.11	107.5 $\pm$ 0.6	10.3 $\pm$ 0.15	95.8 $\pm$ 3.62	8.7 $\pm$ 0.22	96.2 $\pm$ 3.13	15.0 $\pm$ 0.2
mHuBERT-147	16.1 $\pm$ 0.08	19.1 $\pm$ 0.03	32.7 $\pm$ 0.08	8.3 $\pm$ 0.05	10.4 $\pm$ 0.16	12.0 $\pm$ 0.19	22.5 $\pm$ 0.02	12.5 $\pm$ 0.25	8.5 $\pm$ 0.05	14.1 $\pm$ 0.23	100.0 $\pm$ 0.0	17.1 $\pm$ 0.03

	ind	isl	ita	jav	jpn	kam	kan	kat	kaz	kea	khm	kir
MMS-1B	3.9 $\pm$ 0.09	9.4 $\pm$ 0.01	2.1 $\pm$ 0.01	5.4 $\pm$ 0.23	22.2 $\pm$ 0.26	12.5 $\pm$ 0.06	5.8 $\pm$ 0.01	5.6 $\pm$ 0.06	3.3 $\pm$ 0.08	4.9 $\pm$ 0.02	16.4 $\pm$ 0.07	4.6 $\pm$ 0.14
MMS-300M	4.8 $\pm$ 0.1	15.5 $\pm$ 0.07	2.8 $\pm$ 0.04	6.8 $\pm$ 0.15	33.3 $\pm$ 0.29	13.8 $\pm$ 0.39	7.3 $\pm$ 0.12	8.0 $\pm$ 0.08	4.1 $\pm$ 0.01	5.8 $\pm$ 0.03	22.0 $\pm$ 0.14	5.8 $\pm$ 0.15
XLS-R-300M	4.8 $\pm$ 0.21	16.0 $\pm$ 0.25	2.4 $\pm$ 0.01	7.1 $\pm$ 0.19	34.9 $\pm$ 0.17	15.0 $\pm$ 0.15	7.8 $\pm$ 0.04	7.9 $\pm$ 0.03	4.0 $\pm$ 0.16	5.3 $\pm$ 0.03	22.5 $\pm$ 0.85	6.1 $\pm$ 0.05
mHuBERT-147	6.8 $\pm$ 0.15	14.9 $\pm$ 0.27	4.8 $\pm$ 0.06	9.6 $\pm$ 0.09	38.4 $\pm$ 0.73	16.3 $\pm$ 0.08	8.5 $\pm$ 0.12	10.3 $\pm$ 0.19	5.6 $\pm$ 0.17	7.8 $\pm$ 0.01	24.5 $\pm$ 0.11	7.5 $\pm$ 0.08

	kor	lao	lav	lin	lit	ltz	lug	luo	mal	mar	mkd	mlt
MMS-1B	15.0 $\pm$ 0.14	24.9 $\pm$ 0.67	3.4 $\pm$ 0.13	5.1 $\pm$ 0.09	5.0 $\pm$ 0.01	8.9 $\pm$ 0.22	9.5 $\pm$ 0.14	6.3 $\pm$ 0.05	5.3 $\pm$ 0.19	8.7 $\pm$ 0.14	2.6 $\pm$ 0.01	4.9 $\pm$ 0.06
MMS-300M	21.7 $\pm$ 0.5	29.5 $\pm$ 0.22	4.7 $\pm$ 0.09	5.9 $\pm$ 0.12	7.1 $\pm$ 0.03	11.1 $\pm$ 0.11	10.2 $\pm$ 0.01	7.3 $\pm$ 0.07	6.3 $\pm$ 0.05	11.2 $\pm$ 0.07	3.4 $\pm$ 0.08	6.2 $\pm$ 0.04
XLS-R-300M	24.6 $\pm$ 1.37	30.4 $\pm$ 0.15	4.1 $\pm$ 0.08	6.0 $\pm$ 0.13	6.1 $\pm$ 0.01	10.7 $\pm$ 0.01	10.6 $\pm$ 0.02	8.1 $\pm$ 0.12	6.6 $\pm$ 0.14	12.2 $\pm$ 0.02	3.2 $\pm$ 0.02	5.4 $\pm$ 0.01
mHuBERT-147	20.7 $\pm$ 0.22	33.5 $\pm$ 0.17	7.6 $\pm$ 0.1	7.6 $\pm$ 0.06	10.2 $\pm$ 0.07	14.9 $\pm$ 0.03	11.8 $\pm$ 0.15	9.3 $\pm$ 0.22	7.6 $\pm$ 0.05	14.1 $\pm$ 0.15	5.1 $\pm$ 0.18	8.7 $\pm$ 0.13

	mon	mri	msa	mya	nep	nld	nob	nso	nya	oci	ori	orm
MMS-1B	8.6 $\pm$ 0.23	99.3 $\pm$ 0.0	4.6 $\pm$ 0.02	15.7 $\pm$ 0.48	10.0 $\pm$ 0.25	5.3 $\pm$ 0.12	5.2 $\pm$ 0.04	8.0 $\pm$ 0.13	99.3 $\pm$ 0.0	96.3 $\pm$ 2.79	11.3 $\pm$ 0.12	16.8 $\pm$ 0.37
MMS-300M	11.2 $\pm$ 0.04	99.7 $\pm$ 0.35	6.7 $\pm$ 0.05	20.6 $\pm$ 0.2	12.4 $\pm$ 0.34	8.0 $\pm$ 0.16	6.5 $\pm$ 0.05	9.7 $\pm$ 0.11	99.3 $\pm$ 0.0	99.4 $\pm$ 0.0	16.2 $\pm$ 0.09	19.6 $\pm$ 0.17
XLS-R-300M	11.0 $\pm$ 0.32	96.7 $\pm$ 0.24	7.0 $\pm$ 0.09	20.2 $\pm$ 0.11	13.2 $\pm$ 0.47	6.6 $\pm$ 0.14	6.0 $\pm$ 0.13	10.3 $\pm$ 0.21	99.3 $\pm$ 0.0	99.4 $\pm$ 0.0	18.4 $\pm$ 0.05	20.0 $\pm$ 0.47
mHuBERT-147	14.0 $\pm$ 0.43	11.5 $\pm$ 0.01	9.9 $\pm$ 0.16	22.3 $\pm$ 0.12	15.4 $\pm$ 0.19	12.2 $\pm$ 0.26	9.5 $\pm$ 0.16	12.5 $\pm$ 0.07	99.3 $\pm$ 0.0	99.4 $\pm$ 0.0	19.6 $\pm$ 0.1	21.1 $\pm$ 0.3

	pan	pol	por	pus	ron	rus	slk	slv	sna	snd	som	spa
MMS-1B	9.1 $\pm$ 0.03	4.2 $\pm$ 0.01	4.4 $\pm$ 0.08	18.3 $\pm$ 0.12	5.0 $\pm$ 0.03	4.8 $\pm$ 0.01	3.5 $\pm$ 0.07	6.2 $\pm$ 0.0	5.0 $\pm$ 0.08	99.1 $\pm$ 0.0	99.3 $\pm$ 0.0	2.9 $\pm$ 0.08
MMS-300M	11.4 $\pm$ 0.14	6.0 $\pm$ 0.01	5.8 $\pm$ 0.02	21.0 $\pm$ 0.23	7.2 $\pm$ 0.02	6.5 $\pm$ 0.07	4.7 $\pm$ 0.1	8.2 $\pm$ 0.19	5.9 $\pm$ 0.02	96.1 $\pm$ 3.05	97.7 $\pm$ 1.69	3.8 $\pm$ 0.03
XLS-R-300M	12.4 $\pm$ 0.15	5.0 $\pm$ 0.05	4.9 $\pm$ 0.04	21.5 $\pm$ 0.51	6.3 $\pm$ 0.05	6.0 $\pm$ 0.15	4.0 $\pm$ 0.07	7.2 $\pm$ 0.11	6.5 $\pm$ 0.03	99.1 $\pm$ 0.0	96.7 $\pm$ 2.66	3.3 $\pm$ 0.05
mHuBERT-147	14.5 $\pm$ 0.07	10.2 $\pm$ 0.18	9.0 $\pm$ 0.26	24.0 $\pm$ 0.29	10.9 $\pm$ 0.0	9.2 $\pm$ 0.08	7.6 $\pm$ 0.07	11.1 $\pm$ 0.02	8.5 $\pm$ 0.19	99.1 $\pm$ 0.0	100.0 $\pm$ 0.0	6.1 $\pm$ 0.08

	srp	swa	swe	tam	tel	tgk	tha	tur	ukr	umb	urd	uzb
MMS-1B	14.2 $\pm$ 0.11	4.2 $\pm$ 0.05	7.0 $\pm$ 0.23	99.3 $\pm$ 0.0	99.3 $\pm$ 0.0	5.1 $\pm$ 0.04	12.6 $\pm$ 0.11	4.3 $\pm$ 0.16	4.9 $\pm$ 0.01	16.6 $\pm$ 0.27	8.8 $\pm$ 0.11	7.7 $\pm$ 0.04
MMS-300M	16.9 $\pm$ 0.17	5.5 $\pm$ 0.06	10.0 $\pm$ 0.06	98.0 $\pm$ 1.32	99.3 $\pm$ 0.0	5.9 $\pm$ 0.05	16.1 $\pm$ 0.17	5.8 $\pm$ 0.09	6.8 $\pm$ 0.0	18.9 $\pm$ 0.41	10.9 $\pm$ 0.05	10.3 $\pm$ 0.0
XLS-R-300M	16.5 $\pm$ 0.66	5.6 $\pm$ 0.1	9.1 $\pm$ 0.01	92.9 $\pm$ 6.42	99.3 $\pm$ 0.0	6.2 $\pm$ 0.18	16.0 $\pm$ 0.19	5.8 $\pm$ 0.09	6.1 $\pm$ 0.08	20.0 $\pm$ 0.64	13.6 $\pm$ 0.15	10.7 $\pm$ 0.12
mHuBERT-147	18.8 $\pm$ 0.15	6.9 $\pm$ 0.04	15.2 $\pm$ 0.11	16.3 $\pm$ 0.22	99.3 $\pm$ 0.0	7.0 $\pm$ 0.08	17.7 $\pm$ 0.12	7.4 $\pm$ 0.07	9.6 $\pm$ 0.0	23.4 $\pm$ 0.12	15.1 $\pm$ 0.36	13.8 $\pm$ 0.27

	vie	wol	xho	yor	yue	zul
MMS-1B	10.9 $\pm$ 0.11	14.7 $\pm$ 0.1	6.7 $\pm$ 0.09	18.2 $\pm$ 0.22	16.1 $\pm$ 0.01	6.6 $\pm$ 0.03
MMS-300M	14.0 $\pm$ 0.36	16.1 $\pm$ 0.29	7.6 $\pm$ 0.02	21.1 $\pm$ 0.17	58.1 $\pm$ 19.39	8.3 $\pm$ 0.15
XLS-R-300M	15.6 $\pm$ 0.31	16.2 $\pm$ 0.4	8.0 $\pm$ 0.07	22.1 $\pm$ 0.5	46.3 $\pm$ 5.03	8.5 $\pm$ 0.0
mHuBERT-147	17.3 $\pm$ 0.14	17.8 $\pm$ 0.01	9.8 $\pm$ 0.05	23.9 $\pm$ 0.08	36.1 $\pm$ 1.55	11.1 $\pm$ 0.01

Table 11: FLEURS-102 CER (

\downarrow

) average results per language with standard deviations. Best scores in bold.

General

Avg (102)

Seen

(87)

Unseen

(15)

XLS-R-300M

24.5

23.8

28.3

mHuBERT-147

21.1

20.8

22.6

Table 12: FLEURS-102 CER (

\downarrow

) average scores for seen and unseen languages.

B.3 Monolingual versus multilingual HuBERT

In this section we compare the English speech representation from our best mHuBERT-147 model against a model of same capacity, but trained on English only data [23].²²²²22Available at: https://huggingface.co/facebook/hubert-base-ls960 With these experiments, we want to illustrate that mHuBERT-147 is not only competitive against larger multilingual speech representation models (See Sections 4 and 5), but it is also competitive to monolingual solutions that do not share their parameters across languages.

Setup.

We train ASR models for open datasets from the ESB benchmark [20]. We use the data selection method from Lagos and Calapodescu [27] to downsample training data to 50 h for faster training with marginal performance impact. We also adopt their experimental setup, which we now detail.
We train English ASR models using either mHuBERT-147 or the original HuBERT-base (hubert-base-ls960) as speech encoder, and follow the SpeechBrain library [42]’s recipe for ASR – detailed in Appendix Section A.4 (proxy evaluation). In these experiments too, we train ASR models using two distinct optimizers, one for the upstream model, and another for the three subsequent neural networks that precede the CTC output projection layer ( $lr=0.9$ ). Table 13 presents hyper-parameters we use per dataset.

Results.

Table 14 presents WER results for both models across all datasets. The average WER for the ASR models using HuBERT-base and mHuBERT-147 is respectively 23.1 and 23.6. Unsurprisingly, we find that fine-tuning a pre-trained model trained solely for the target language (HuBERT-base) yields better results for almost all datasets. Nonetheless, we observe that mHuBERT-147, despite being trained on multilingual settings and having to share its parameters across many languages, still performs competitively with the English model. This model is on average only 0.5 WER points worse than HuBERT-base. We believe that the competitiveness of mHuBERT-147 compared to this monolingual model highlights its potential as a compact yet powerful backbone for multilingual speech applications.

	warm-up	learning rate	dropout	epochs
	steps	upstream
AMI [5, 43]	200	5e-5	0.3	30
CV [1]	100	5e-5	0	30
Earnings-22 [15]	200	5e-5	0	30
GigaSpeech [7]	200	1e-5	0	40
LibriSpeech [36]	100	1e-5	0.15	40
TED-LIUM [22]	200	5e-5	0	40
VP [55]	100	5e-5	0	35

Table 13: Hyper-parameters for the English ASR models following the setup from Lagos and Calapodescu [27].

	HuBERT-base	mHuBERT-147
AMI	33.1	33.6 (+0.5)
CV	37.2	35.4 (-1.8)
Earnings-22	35.3	34.7 (-0.6)
GigaSpeech	25.9	26.3 (+0.4)
LibriSpeech-clean	7.7	9.7 (+2.0)
LibriSpeech-other	13.6	17.3 (+3.7)
TED-LIUM	11.7	13.1 (+1.4)
VP	18.7	19.0 (+0.3)
Average WER ( $\downarrow$ )	23.1	23.6 (+0.5)

Table 14: WER (

\downarrow

) scores for English ASR systems trained on the different datasets from the ESB Benchmark. Following Lagos and Calapodescu [27], only 50 h of training data are used. The score difference between mHuBERT-147 and HuBERT-base scores is presented between parentheses.

Appendix C Data

C.1 Validation Set

The validation set is the same for all our experiments. It was created by sampling 5 utterances per (language, source) pair from the training set, resulting in 1,275 examples. These examples are not removed from the training set.

C.2 VoxLingua107 Dataset Filtering

Manual inspection revealed that some language splits of this dataset contained a considerable amount of music-, noise- and silence-only files. These occurrences were more frequent in less-resourced languages. In order to increase the quality of the data we feed into our self-supervised models, and to more accurately estimate the amount of speech data per language they learn from, we filtered this dataset by performing music, noise and silence detection using the inaSpeechSegmenter tool [16].

Using this tool’s default settings, and knowing that the average utterance length for this dataset is $9s$ , we classify a file as music if a music event is detected for longer than $2s$ . Similarly, a file is classified as noise if there is a noise event for longer than $2s$ , or if a non-energy event (i.e. silence) is detected for longer than $5s$ . These settings were optimized using Cebuano, a language for which we observed a significant amount of noisy utterances. In total, we removed over 332K potentially noisy utterances (249K music, 83K noise/silence). We make both our filtering script²³²³23Scripts available at: https://github.com/utter-project/mHuBERT-147-scripts/tree/main/01_dataset_preproc/dataset_specific_preproc/voxlingua107 and list of retained VL files available.²⁴²⁴24Manifest files available at: https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest

Table 15 presents the 107 languages of this dataset sorted into buckets corresponding to the percentage of noisy input detected by the automatic tool. We notice that for most languages, this percentage is limited: 77 languages have noise below 15%. However, for some low-resource languages, the amount of identified noisy utterances can go up to 80%. We take these automatic filtering results cautiously – as they might not be truly representative of the amount of noise present in all of these languages. However, we do believe they illustrate an existing quality issue of this popular corpus for some languages.

% of Noise

Count

Languages

[0,5]

abk, hye, kat, lit, ltz, sqi, tuk

(5,10]

afr, amh, ara, asm, aze, bak, bel,

bod, bos, bul, cat, ces, deu, ell,

epo, est, fas, fra, glg, heb, hrv,

hun, ita, kaz, kor, mkd, mlt, mon,

nep, nld, nor, pol, por, pus, ron,

rus, sin, slk, slv, som, srp, swe,

tat, tgk, tur, ukr, uzb, zh-CN

(10,15]

dan, eng, fil, fin, hat, hau, hin,

ind, isl, jpn, kan, lav, lin, mal,

mar, mlg, msa, oci, spa, swa,

tam, vie

(15,20]

ben, cym, eus, fao, glv,

guj, lao, tel, tha, urd

(20,30]

bre, ina, mya, nno, pan, yid

(30,40]

sna, sun, yor

(40,60]

ceb, jav, khm, mri, sco, snd

(50,80]

grn, haw, lat, san, war

Table 15: The overall percentage of noise (music, silence, noise) detected in the different language splits of VL using an automatic tool.

Lastly, we note that another key issue in this corpus is the miscategorization of language in many speech utterances. In particular, we observed many English utterances that were labelled as belonging to other languages, with this issue being far more prevalent for low-resource languages. Since resolving this would require a powerful language identification tool to work well in all of the 107 languages present in this corpus, we decided to not filter or remove these utterances – but we do highlight that the number of utterances in low-resourced languages for this corpus is probably an over-estimation.

Dataset	Full Name	Download URL
Aishell	Aishell [4]	https://www.openslr.org/33/
Aishell-3	AISHELL-3 [49]	https://www.openslr.org/93/
B-TTS	BibleTTS [25]	http://www.openslr.org/129/
Clovacall	ClovaCall [21]	https://github.com/clovaai/ClovaCall
CV	Common Voice version 11.0 [1]	https://commonvoice.mozilla.org/en/datasets
G-TTS	High quality TTS data for Javanese [50]	http://www.openslr.org/41/
	High quality TTS data for Khmer [50]	http://www.openslr.org/42/
	High quality TTS data for Nepali [50]	http://www.openslr.org/43/
	High quality TTS data for Sundanese [50]	http://www.openslr.org/44/
	High quality TTS data for four South African languages [54]	http://www.openslr.org/32/
	High quality TTS data for Bengali languages [50]	https://www.openslr.org/37/
IISc-MILE	IISc-MILE Tamil ASR Corpus [32, 33]	https://www.openslr.org/127/
IISc-MILE	IISc-MILE Kannada ASR Corpus [32, 33]	http://www.openslr.org/126/
JVS	Japanese versatile speech [52]	https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus
Kokoro	Kokoro	https://github.com/kaiidams/Kokoro-Speech-Dataset
kosp2e	Korean Speech to English Translation Corpus [11]	https://github.com/warnikchow/kosp2e
MLS	Multilingual LibriSpeech [39]	http://www.openslr.org/94/
MS	MediaSpeech [26]	https://www.openslr.org/108/
Samrómur	Samrómur Unverified 22.07 [51]	https://www.openslr.org/128/
THCHS-30	THCHS-30 [56]	https://www.openslr.org/18/
THUYG-20	THUYG-20 [45]	https://www.openslr.org/22/
VL	VoxLingua107 [53]	https://bark.phon.ioc.ee/voxlingua107/
VP	VoxPopuli [55]	https://github.com/facebookresearch/voxpopuli/

Table 16: Downloading URLs for mHuBERT-147 data. All data was downloaded between November 2022 and February 2023.

IETF code	Language	Dataset	# Hours
abk	Abkhazian	CV	34.0
abk	Abkhazian	VL	10.0
afr	Afrikaans	G-TTS	3.3
afr	Afrikaans	VL	100.0
amh	Amharic	VL	74.0
ara	Arabic	CV	61.0
ara	Arabic	VL	53.6
asm	Assamese	CV	1.0
asm	Assamese	VL	143.0
ast	Asturian	CV	0.01
aze	Azerbaijani	CV	0.05
aze	Azerbaijani	VL	55.6
bak	Bashkir	CV	213.4
bak	Bashkir	VL	54.7
bas	Basaa	CV	0.9
bel	Belarusian	CV	1,101.6
bel	Belarusian	VL	126.4
ben	Bengali	CV	33.2
		G-TTS	2.0
		VL	45.9
bod	Tibetan	VL	92.4
bos	Bosnian	VL	97.7
bre	Breton	CV	4.4
bre	Breton	VL	33.1
bul	Bulgarian	CV	4.6
		VL	47.7
		VP	773.3
cat	Catalan	CV	1,638.7
cat	Catalan	VL	80.3
ceb	Cebuano	VL	3.9
ces	Czech	CV	38.2
		VL	62.4
		VP	866.1
chv	Chuvash	CV	17.3
ckb	Central Kurdish	CV	90.5
cnh	Hakha Chin	CV	0.7
cym	Welsh	CV	100.6
cym	Welsh	VL	65.6
dan	Danish	CV	3.5
		VL	25.0
		VP	728.4
deu	German	CV	1,091.0
		MLS	1,966.5
		VL	36.3
		VP	274.8
div	Maldivian	CV	32.4
ell	Greek	CV	13.0
		VL	60.5
		VP	573.2
eng	English	CV	2,213.9
		MLS	44,659.7
		VL	43.5
		VP	293.6
epo	Esperanto	CV	1,355.6
epo	Esperanto	VL	8.9
est	Estonian	CV	29.9
		VL	34.7
		VP	682.7

Table 17: (1/5) List of included languages, with corresponding amount of speech data per dataset.

IETF code	Language	Dataset	# Hours
eus	Basque	CV	79.0
eus	Basque	VL	25.5
ewe	Ewe	B-TTS	76.6
fao	Faroese	VL	56.4
fas	Persian	CV	312.4
fas	Persian	VL	52.0
fil	Tagalog	VL	82.7
fin	Finnish	CV	5.0
		VL	29.5
		VP	816.1
fra	French	CV	836.0
		MLS	1,076.6
		VL	63.2
		VP	290.8
fry	Frisian	CV	41.2
gle	Irish	CV	3.2
glg	Galician	CV	4.4
glg	Galician	VL	66.0
glv	Manx Gaelic	VL	3.5
grn	Guarani	CV	0.4
grn	Guarani	VL	1.0
guj	Gujarati	VL	39.5
hat	Haitian Creole	VL	86.3
hau	Hausa	B-TTS	85.6
		CV	2.3
		VL	81.1
haw	Hawaiian	VL	5.4
heb	Hebrew	VL	89.7
hin	Hindi	CV	5.0
hin	Hindi	VL	73.2
hrv	Croatian	VL	109.4
hrv	Croatian	VP	836.6
hsb	Upper Sorbian	CV	1.5
hun	Hungarian	CV	9.5
		VL	68.8
		VP	851.0
hye	Armenian	CV	1.0
hye	Armenian	VL	66.5
ibo	Igbo	CV	0.01
ina	Interlingua	CV	8.3
ina	Interlingua	VL	2.1
ind	Indonesian	CV	20.6
ind	Indonesian	VL	34.4
isl	Icelandic	samromur	2,088.3
isl	Icelandic	VL	81.2
ita	Italian	CV	271.9
		MLS	247.4
		VL	46.6
		VP	613.8
jav	Javanese	G-TTS	3.5
jav	Javanese	VL	24.4
jpn	Japanese	CV	37.0
		JVS	26.4
		kokoro	60.0
		VL	50.1
kab	Kabyle	CV	516.6

Table 18: (2/5) List of included languages, with corresponding amount of speech data per dataset.

IETF code	Language	Dataset	# Hours
kan	Kannada	IISc-MILE	273.8
kan	Kannada	VL	39.9
kat	Georgian	CV	6.6
kat	Georgian	VL	93.4
kaz	Kazakh	CV	0.6
kaz	Kazakh	VL	72.6
khm	Khmer	G-TTS	4.0
khm	Khmer	VL	27.0
kin	Kinyarwanda	CV	1,982.7
kir	Kyrgyz	CV	32.6
kmr	Northern Kurdish	CV	45.8
kor	Korean	clovacall	38.1
		kosp2e	190.9
		VL	71.5
lao	Lao	VL	36.1
lat	Latin	VL	30.9
lav	Latvian	CV	3.0
		VL	37.0
		VP	868.4
lin	Lingala	B-TTS	54.0
lin	Lingala	VL	77.8
lit	Lithuanian	CV	7.1
		VL	78.4
		VP	796.2
ltz	Luxembourgish	VL	71.8
lug	Ganda	CV	363.0
mal	Malayalam	CV	0.5
mal	Malayalam	VL	42.5
mar	Marathi	CV	11.9
mar	Marathi	VL	74.7
mdf	Moksha	CV	0.2
mhr	Meadow Mari	CV	97.5
mkd	Macedonian	CV	0.2
mkd	Macedonian	VL	103.9
mlg	Malagasy	VL	94.5
mlt	Maltese	CV	3.9
		VL	62.9
		VP	818.1
mon	Mongolian	CV	6.6
mon	Mongolian	VL	65.9
mri	Māori	VL	20.5
mrj	Western Mari	CV	6.5
msa	Malay	VL	71.8
mya	Burmese	VL	31.5
myv	Erzya	CV	1.1
nan-tw	Taiwanese Hokkien	CV	0.7
nep	Nepali	CV	0.01
		G-TTS	2.8
		VL	65.6
nld	Dutch	CV	72.1
		MLS	1,554.2
		VL	37.2
		VP	277.0
nno	Norwegian Nynorsk	CV	0.4
nno	Norwegian Nynorsk	VL	43.3
nor	Norwegian	VL	98.2
oci	Occitan	VL	13.7

Table 19: (3/5) List of included languages, with corresponding amount of speech data per dataset.

IETF code	Language	Dataset	# Hours
ori	Odia	CV	0.8
pan	Punjabi	CV	1.0
pan	Punjabi	VL	42.1
pol	Polish	CV	129.4
		MLS	103.6
		VL	76.2
		VP	841.9
por	Portuguese	CV	102.3
		MLS	161.0
		VL	58.6
		VP	851.6
pus	Pashto	VL	44.7
rm-sursilv	Sursilvan	CV	2.4
rm-vallader	Vallader	CV	1.2
ron	Romanian	CV	8.5
		VL	59.0
		VP	834.8
rus	Russian	CV	149.1
rus	Russian	VL	67.5
sah	Sakha	CV	2.7
san	Sanskrit	VL	4.6
sat	Santali	CV	0.4
sco	Scots	VL	1.5
sin	Sinhala	VL	60.6
skr	Saraiki	CV	1.3
slk	Slovak	CV	12.9
		VL	36.6
		VP	644.5
slv	Slovenian	CV	7.6
		VL	112.3
		VP	832.1
sna	Shona	VL	21.5
snd	Sindhi	VL	48.2
som	Somali	VL	94.9
sot	Southern Sotho	G-TTS	3.2
spa	Spanish	CV	380.7
		MLS	917.7
		VL	33.9
		VP	301.5
sqi	Albanian	VL	67.2
srd	Sardinian	CV	0.7
srp	Serbian	CV	0.8
srp	Serbian	VL	47.5
sun	Sundanese	G-TTS	2.1
sun	Sundanese	VL	43.6
swa	Swahili	CV	304.2
swa	Swahili	VL	57.6
swe	Swedish	CV	29.5
		VL	31.3
		VP	827.4
tam	Tamil	CV	186.1
		IISc-MILE	132.2
		VL	44.1
tat	Tatar	CV	20.3
tat	Tatar	VL	93.4
tel	Telugu	VL	64.3
tgk	Tajik	VL	60.5

Table 20: (4/5) List of included languages, with corresponding amount of speech data per dataset.

IETF code	Language	Dataset	# Hours
tha	Thai	CV	134.7
tha	Thai	VL	50.8
tig	Tigre	CV	0.01
tpi	Tok Pisin	CV	3.3
tsn	Tswana	G-TTS	3.5
tuk	Turkmen	VL	81.8
tur	Turkish	CV	59.3
		MS	10.0
		VL	54.5
tw-akuapem	Akwapem Twi	B-TTS	59.8
tw-asante	Asante Twi	B-TTS	56.8
uig	Uyghur	CV	94.9
uig	Uyghur	THUYG-20	20.7
ukr	Ukrainian	CV	52.5
ukr	Ukrainian	VL	49.7
urd	Urdu	CV	38.8
urd	Urdu	VL	35.2
uzb	Uzbek	CV	69.7
uzb	Uzbek	VL	42.2
vie	Vietnamese	CV	3.8
vie	Vietnamese	VL	55.5
vot	Votic	CV	0.1
war	Waray	VL	3.9
xho	Xhosa	G-TTS	3.1
yid	Yiddish	VL	35.9
yor	Yoruba	B-TTS	24.8
yor	Yoruba	VL	66.0
yue	Cantonese	CV	15.6
zh-CN	Chinese (PRC)	Aishell	151.1
		Aishell-3	60.6
		CV	104.6
		THCHS-30	25.5
		VL	40.9
zh-HK	Chinese (Hong Kong)	CV	89.5
zh-TW	Chinese (Taiwan)	CV	56.3
Total			90,429.5

Table 21: (5/5) List of included languages, with corresponding amount of speech data per dataset.

		Monolingual ASR	Multilingual ASR		LID	Multilingual ASR + LID
		normal	normal	few-shot	normal	normal	normal	few-shot
SSL	# Params	CER ( $\downarrow$ )	CER ( $\downarrow$ )	CER ( $\downarrow$ )	ACC ( $\uparrow$ )	ACC ( $\uparrow$ )	CER ( $\downarrow$ )	CER ( $\downarrow$ )	SUPERB_s ( $\uparrow$ )
MMS-1B	965M	33.3 / 25.7	21.3 / 18.1	30.2 / 30.8	84.8 / 86.1	73.3 / 74.8	26.0 / 25.5	25.4 / 24.8	983.5 \scalerel* X / 948.1 \scalerel* X
NWHC1	317M	39.5 / 30.5	28.9 / 21.5	41.4 / 38.6	67.1 / 87.4	77.1 / 90.6	28.8 / 21.5	40.3 / 38.2	774.4 / 876.9 \scalerel* X
NWHC2	317M	39.5 / 30.5	29.3 / 21.6	42.0 / 39.3	64.4 / 88.1	77.4 / 90.6	28.4 / 21.8	41.5 / 38.8	759.9 / 873.3
mHuBERT-147-3rd	95M	34.2 / 26.3	23.6 / 22.0	33.2 / 32.9	85.3 / 91.0	81.4 / 90.0	26.2 / 22.1	34.9 / 33.5	949.8 \scalerel* X / 950.2 \scalerel* X
mHuBERT-147-2nd	95M	35.9 / 27.6	25.4 / 22.5	34.2 / 33.8	74.8 / 90.1	81.0 / 89.0	26.3 / 23.6	33.9 / 34.4	895.0 / 925.7
MMS-300M	317M	33.8 / 30.5	28.7 / 24.0	36.5 / 36.5	62.3 / 84.3	71.9 / 74.3	31.5 / 30.0	30.9 / 29.2	824.9 \scalerel* X / 844.3
XLS-R-300M	317M	39.7 / 30.6	29.2 / 22.0	40.9 / 39.3	66.9 / 87.9	55.6 / 85.6	28.4 / 22.9	42.1 / 42.4	730.8 / 850.5
WavLabLM-large-MS	317M	40.5 / 32.8	37.8 / 31.9	43.8 / 42.8	71.7 / 81.1	70.8 / 80.0	37.0 / 32.2	43.4 / 41.2	707.5 / 740.9