MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research

Abstract

Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, multilingual automatic speech recognition (ASR) has also garnered significant attention, as evidenced by systems like Whisper. However, the proprietary nature of the training data has impeded researchers’ efforts to study multilingual ASR. This paper introduces MSR-86K, an evolving, large-scale multilingual corpus for speech recognition research. The corpus is derived from publicly accessible videos on YouTube, comprising 15 languages and a total of 86,300 hours of transcribed ASR data. We also introduce how to use the MSR-86K corpus and other open-source corpora to train a robust multilingual ASR model that is competitive with Whisper. MSR-86K will be publicly released on HuggingFace111https://huggingface.co/datasets/Alex-Song/MSR-86K, and we believe that such a large corpus will pave new avenues for research in multilingual ASR.

Index Terms: speech recognition, multilingual, corpus

1 Introduction

Thanks to the rapid development of deep learning, research in speech recognition has gradually shifted from hybrid systems based on Hidden Markov Models to end-to-end ASR systems entirely built on neural networks[1, 2, 3, 4, 5, 6]. In fact, the swift progress of end-to-end ASR has also benefited from the contribution of open-source corpora, such as the commonly used LibriSpeech[7] and GigaSpeech[8] for English, as well as AISHELL[9] and WenetSpeech[10] for Chinese. These open-source corpora have facilitated research in the field of speech recognition by both academia and industry. In the multilingual domain, the Common Voice[11] project, alongside the multilingual LibriSpeech (MLS) [12] corpus released by Meta, has greatly promoted research in multilingual ASR. In recent times, the success of OpenAI’s Whisper[5] model has demonstrated that big data combined with large models can yield improved performance. However, Whisper has not made its training data public, hindering researchers’ ability to replicate the results. The MSR-86K corpus introduced in this paper aims to bridge this gap, further advancing research in multilingual ASR.

Table 1: Compare MSR-86K to common public multilingual ASR corpora.
Corpus # Languages     Total Hours
   (Transcribed)
Domains Speech Type
BABEL[13] 17 1k Conversational Spontaneous
Common Voicecite[11] 112 18k Open domain Read
MLS[12] 8 50.5k Audiobook Read
FLEURS 102 1.4k Wikipedia Read
CMU Wilderness[14] 700 14k Religion Read
CoVoST-2[15] 22 2.9k Open domain Read
Europarl-ST[16] 6 500 Parliament Spontaneous
MuST-C[17] 9 385 TED talks Spontaneous
mTEDx[18] 9 1k TED talks Spontaneous
VoxPopuli[19] 16 1791 Parliament Spontaneous
CVSS[20] 22 1.1k Open domain Read/Synthetic
MSR-86K (ours) 15 86.3k YouTube Spontaneous

Existing multilingual ASR corpora have two main shortcomings: firstly, most corpora are dominated by English and Western European languages, lacking sufficient linguistic diversity. Secondly, although some corpora have a broad coverage of languages, the duration of recordings for each language is often minimal, insufficient for building a usable ASR system. The MSR-86K corpus addresses these issues by ensuring substantial coverage of languages and providing enough data per language to independently train a robust ASR system. We constructed a series of protocols to automatically retrieve publicly accessible videos from YouTube and set up a data processing pipeline to automatically generate the MSR-86K corpus, significantly reducing the costs associated with data collection and labeling. Table 1 illustrates the distinctions between our MSR-86K and other public multilingual ASR corpora.

Whisper is an excellent multilingual model, but its best-performing variant has a large number of parameters, which results in slower inference speed and greater memory overhead. In this paper, we introduce how to use easily accessible unsupervised data for pre-training and fine-tuning with MSR-86K and other open-source corpora to build a robust multilingual ASR model that is faster, smaller in size, and has performance that matches or even exceeds that of the Whisper large model.

The rest of the paper is organized as follows. In Section 2, the process of constructing the MSR-86K corpus is described. In Section 3, we introduce our experiments and discussions. Finally, the paper is concluded in Section 4.

2 Corpus Construction

This section describes the major steps involved in creating the MSR-86K corpus, and Figure 1 illustrates this process.

Refer to caption
Figure 1: The construction process of the MSR-86K corpus.

2.1 Data collection

Creating keyword lists. First, we start by generating a preliminary list of keywords through querying Wikipedia articles in the target language. Recognizing the presence of numerous non-target language terms within these entries, we then implement a keyword filtering module to refine our list. The module selectively filters and retains terms that are likely to be significant keywords, ensuring relevance in the target language.

Retrieving video IDs. Next, we use the YouTube search engine to search the keyword list, obtaining a list of video IDs. Since different keywords may lead to the same videos, it is necessary to deduplicate the video ID list. We hope to share the dataset, so we further filter out videos that are available for public download, and remove private, paid, and restricted videos.

Detecting video subtitles. In order to guarantee the quality of our corpora annotations to the greatest extent, we implement a subtitle detection process for videos, filtering out those that feature manually uploaded subtitles. The rest of the videos that lack subtitles are relegated to function as unsupervised data sources, utilizing solely their audio components.

Downloading audio and subtitles. We download the audio tracks of videos and their corresponding manually uploaded subtitles through the YouTube download engine222https://github.com/yt-dlp/yt-dlp as the primary data source for MSR-86K. Additionally, we download the audio from some videos without subtitles to serve as the data source for unsupervised pre-training. Each audio file is converted into a single-channel wav format, sampled at a 16 kHz rate.

2.2 ASR corpus construction

Text normalization. Video subtitles contain several non-semantic symbols. To streamline further processing, we need to normalize the text. This involves transforming the case, removing punctuation and emojis, converting numbers, and eliminating special symbols associated with specific languages.

Forced alignment. Even though video subtitles come with timestamps, we often notice a lot of them aren’t accurate, thus necessitating a re-alignment of the audio with the subtitles. Thanks to the work of predecessors, we use a pre-trained ASR model based on the connectionist temporal classification (CTC) [21] criterion for alignment, and take the median of the alignment scores as the cutoff for filtering.

Refer to caption
Figure 2: The description of duration balance.

Duration balance. Due to memory constraints, subtitles are usually segmented for forced alignment. However, each segment does not necessarily correspond to the exact endpoint of the speaker’s utterance, resulting in a relatively short distribution of audio duration. To balance the audio duration and ensure the integrity of the speech content as much as possible, we conducted voice activity detection (VAD) based on the output of the CTC model, and limited the maximum duration to 20 seconds. Figure 2 shows the duration statistics before and after VAD.

LID filter. After reviewing the outcomes of forced alignment, we noticed that there were still some inaccuracies. The most common issues included mismatched languages between the audio and the subtitles, subtitles that were categorized as descriptive captions, and audio that was either purely music or completely silent. Consequently, we develop a language identification (LID) model that effectively filters out sentences where discrepancies exist between the audio and the subtitles, significantly improving the quality of the data.

ASR filter. To further improve data quality, we train an ASR model using both existing open-source data and the data filtered by the LID model. This ASR model is used to decode the data processed by the LID filter and calculate the word error rate (WER). By filtering out segments with higher WER, we ensure greater accuracy in our dataset annotations.

Table 2: Duration of different languages in the MSR-86K corpus and the results of the Monolingual ASR on the dev set.
Language Duration (hrs) Monolingual
      train       dev WER/CER (%)
Spanish 13976.84 18.63 6.36
Korean 10338.66 18.56 4.79
English 9795.46 17.42 5.61
French 8316.70 15.84 8.43
German 6862.00 14.38 6.39
Hindi 5986.50 11.62 9.90
Vietnamese 5957.47 11.54 3.51
Italian 5691.28 11.40 5.18
Dutch 4138.50 9.67 5.99
Portuguese 3737.54 9.62 7.43
Thai 3674.70 9.47 4.17
Russian 3188.52 9.35 8.90
Indonesian 1982.87 9.12 6.75
Japanese 1779.03 8.54 3.44
Arabic 873.84 4.95 9.40
Total/Avg 86299.91 180.11 6.42

Data split. Based on the forced alignment scores, LID scores, and WER, we select a portion of the data with the highest quality to serve as the development set, while the remaining data is allocated for the training set. The distribution of durations across different languages is detailed in Table 2. For the test set, we use the test portion of the Common Voice corpus, which has undergone stringent manual verification to ensure the high quality required for multilingual ASR testing.

2.3 Unsupervised corpus construction

For audio without subtitles, we employ a sound event detection model to filter out music and noise, and segment the audio at points of silence into clips shorter than 30 seconds. Ultimately, we obtain a total of 200k hours of unsupervised data.

3 Experiments and Discussions

In this section, we first introduce the evaluation of the MSR-86K corpus, assessing the overall quality of the corpus. Secondly, we describe how to use unsupervised data for pre-training, and then fine-tune with MSR-86K and other open-source data to obtain a non-autoregressive multilingual speech recognition model that outperforms the whisper large model.

3.1 Data evaluation

To evaluate the quality of the MSR-86K corpus, we trained a monolingual model using the training set for each language. Then, we performed Beam Search decoding on the MSR-86K development set and calculated the word error rate and character error rate (CER). Our evaluation model utilizes the Transformer-CTC architecture, in which dmodelsuperscript𝑑𝑚𝑜𝑑𝑒𝑙d^{model}italic_d start_POSTSUPERSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUPERSCRIPT=768, dffnsuperscript𝑑𝑓𝑓𝑛{d^{ffn}}italic_d start_POSTSUPERSCRIPT italic_f italic_f italic_n end_POSTSUPERSCRIPT=3072, dheadsuperscript𝑑𝑒𝑎𝑑{d^{head}}italic_d start_POSTSUPERSCRIPT italic_h italic_e italic_a italic_d end_POSTSUPERSCRIPT=12 , num_layers𝑛𝑢𝑚_𝑙𝑎𝑦𝑒𝑟𝑠num\_layersitalic_n italic_u italic_m _ italic_l italic_a italic_y italic_e italic_r italic_s=12. In addition, a convolutional front-end was used to sub-sample the acoustic features by a factor of 6. Moreover, each language is equipped with its own respective vocabulary, which employs a byte-level byte-pair encoding (BPE) model with a vocabulary size of 2000.

As shown in Table 2, the monolingual ASR models for 15 languages all achieved a WER or CER below 10%percent\%% on their respective development sets, with some languages reaching below 5%percent\%% , and an average error rate of 6.42%percent\%% across all languages. Considering that our evaluation model does not employ state-of-the-art ASR models and given the spontaneous nature of YouTube audio, an overall error rate of 6.42%percent\%% meets our expectations, indicating that the data quality has reached a relatively ideal level. Therefore, in practice, the development set of MSR-86K can serve as a multilingual test set of the YouTube domain for other open-source corpora.

Refer to caption
Figure 3: The workflow of our multilingual ASR model training.

3.2 Multilingual ASR construction

Whisper is an excellent multilingual model that performs well across mainstream languages around the world. However, Whisper has not made its training data public, making it difficult for researchers to replicate its results. Our MSR-86K corpus effectively bridges this gap and can facilitate researchers’ studies on large-scale multilingual speech recognition. Additionally, the best-performing model of Whisper has a high parameter count of up to 1.55 billion, which results in slower inference speed and also requires more memory and computational resources. In this section, we explain how to leverage easily accessible unsupervised data for pre-training, and then fine-tune with the MSR-86K and other existing multilingual open-source corpora to develop a multilingual ASR model that has a smaller parameter size, faster speed, and performance that is comparable to or even surpasses that of Whisper. The workflow of our multilingual ASR model training is illustrated in Figure 3.

Data preparation. Whisper (v2) was trained using 680k hours of annotated data, while Whisper larger-v3 has reached a scale of 5 million hours, which is daunting for the average researcher. As illustrated in Table 3, we employed our contributed MSR-86K and various other open-source multilingual corpora for our transcribed data. In addition, to reduce the model’s dependency on transcribed data, we explored unsupervised pre-training methods. By leveraging the data listed in Table 3 and incorporating the unsupervised data detailed in Section 2.3, we amassed a comprehensive corpus of 400k hours.

Table 3: Summary of multilingual open-source corpora for our experiments.
Language Corpus Total Hours
English Librispeech[7], mTEDx[18], GigaSpeech[8], MLS[12], The Peoples’s Speech[22], CommonVoice[11], MSR-86K 78180.96
Chinese AISHELL-1[9], AISHELL-2[23], Thchs30[24], Aidatatang_200zh[25],
Primewords[26], TAL_ASR[27], TAL_CSASR[28], WenetSpeech[10], CommonVoice
12227.58
Spanish mTEDx, CommonVoice, MLS,
MSR-86K
15507.07
Korean CommonVoice, Zeroth_Korean[29], MSR-86K 10418.17
French CommonVoice, MLS, MSR-86K 10125.3
German Bundestag[30], MLS, CommonVoice,
MSR-86K
10278.88
Hindi CommonVoice, MSR-86K 5994.73
Vietnamese CommonVoice, MSR-86K 5960.94
Italian CommonVoice, MSR-86K, MLS 6183.40
Dutch CommonVoice, MSR-86K, MLS 5750.03
Portuguese CommonVoice, MSR-86K, MLS,
mTEDx
4074.15
Thai CommonVoice, MSR-86K 3726.11
Russian CommonVoice, Open_STT[32], Golos[31], MSR-86K 8149.94
Indonesian CommonVoice, MSR-86K 1994.47
Japanese CommonVoice, MSR-86K, JTubeSpeech[33], Reazonspeech[34] 21774.56
Arabic CommonVoice, MSR-86K, MGB2[35], QASR[36] 4162.44
Total 204508.73
Table 4: Compare our multilingual ASR model and Whisper model in terms of WER/CER (%percent\%% ) on the open-source test set.
Language Test set Whisper Ours
                             Medium (769M)                              Larger v2 (1.55B)                              HuBERT-CTC (362M)
           without LID             with LID            without LID             with LID            without LID               with LID
English          Librispeech test-clean 3.5 2.8 3.4 2.6 2.5 2.3
         Librispeech test-other 6.5 6.2 5.3 4.9 5.2 4.9
         Gigaspeech 13.6 12.7 11.8 10.6 11.8 10.5
         MLS 8.1 7.0 7.3 6.8 6.6 6.4
         TEDLIUM2 4.6 4.5 4.5 4.4 4.3 4.2
         CommonVoice 22.4 13.4 17.6 10.8 11.2 10.7
Chinese          AISHELL-1 7.1 7.3 6.2 6.2 3.2 2.8
         AISHELL-2 test-mic 6.8 6.1 5.3 5.3 4.8 4.3
         THCHS-30 8.6 8.6 6.8 6.8 6.2 6.0
         WenetSpeech dev 11.7 11.5 10.4 10.4 7.1 6.5
         WenetSpeech test-meeting 22.0 22.5 22.1 21.6 12.7 10.1
         CommonVoice 16.1 16.0 15.7 14.2 10.2 9.8
Spanish        CommonVoice 9.9 7.8 6.5 6.3 6.3 5.8
Korean 7.9 7.9 6.2 6.2 6.0 5.8
French 17.9 14.7 12.9 11.6 10.4 10.1
German 10.1 8.3 7.6 6.5 6.4 6.2
Hindi 62.0 46.4 53.0 37.0 19.6 17.1
Vietnamese 26.7 24.8 23.2 21.3 12.3 10.0
Italian 12.2 9.3 8.9 7.6 7.7 7.2
Dutch 10.5 7.8 7.6 5.8 5.9 5.6
Portuguese 11.9 8.8 8.9 7.2 6.9 6.5
Thai 14.1 12.0 10.9 10.3 3.2 3.0
Russian 12.0 10.6 8.6 8.1 8.6 7.5
Indonesian 13.8 13.0 11.5 8.7 8.1 8.0
Japanese 14.5 11.4 12.1 9.7 10.5 9.6
Arabic 31.7 30.7 21.8 21.2 19.8 15.5
Avg 14.9 12.8 12.2 10.5 8.4 7.6
Table 5: Comparison of WER/CER(%percent\%%) on the MSR-86K development set for Whisper and our system with LID provided in advance.
Language Whisper Ours
Medium (769M) Larger v2 (1.55B) HuBERT-CTC (362M)
English 4.9 4.4 4.0
Spanish 7.4 7.1 5.6
Korean 9.6 7.2 4.5
French 9.8 9.4 7.0
German 6.3 5.6 5.0
Hindi 69.0 47.4 7.5
Vietnamese 28.8 19.2 3.1
Italian 6.2 5.8 4.1
Dutch 8.5 7.3 5.1
Portuguese 8.7 7.8 6.1
Thai 27.9 21.7 3.0
Russian 9.7 9.7 7.3
Indonesian 13.7 11.0 5.0
Japanese 7.6 6.9 3.4
Arabic 67.4 44.5 6.3
Avg 19.0 14.3 5.1

Pre-training. We first conducted unsupervised pre-training with the prepared data. Given the superior performance of HuBERT[37], we chose it as the criterion for unsupervised pre-training. We used a Transformer encoder similar to the one described in Section 3.1 as the acoustic encoder, where dmodelsuperscript𝑑𝑚𝑜𝑑𝑒𝑙d^{model}italic_d start_POSTSUPERSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUPERSCRIPT=1024, dffnsuperscript𝑑𝑓𝑓𝑛{d^{ffn}}italic_d start_POSTSUPERSCRIPT italic_f italic_f italic_n end_POSTSUPERSCRIPT=4096, dheadsuperscript𝑑𝑒𝑎𝑑{d^{head}}italic_d start_POSTSUPERSCRIPT italic_h italic_e italic_a italic_d end_POSTSUPERSCRIPT=16, num_layers𝑛𝑢𝑚_𝑙𝑎𝑦𝑒𝑟𝑠num\_layersitalic_n italic_u italic_m _ italic_l italic_a italic_y italic_e italic_r italic_s=24.

Fine-tuning. Next, we fine-tuned the pre-trained HuBERT model using the dataset presented in Table 3, with CTC as the training criterion. Similar to Whisper, our vocabulary is shared across all languages. We trained a byte-level BPE model with a vocabulary size of 10,000 using the texts from the corpora presented in Table 3 to establish the lexicon for CTC, to which we added an extra token to signify the blank symbol.

LID Prompt-tuning. Multilingual ASR typically encounters two usage scenarios. The first scenario is where the language of the speech to be recognized is not known in advance, necessitating the model to identify it autonomously. In the second, the language information is provided in advance, guiding the model to bolster its performance in recognizing the specified language. To enable the CTC model to accommodate both scenarios, we employed the method proposed in[38], using language identity (LID) as a prompt to enhance the recognition performance of the target language.

NNLM Training. To further enhance the performance of the HuBERT-CTC multilingual ASR model, we trained a simple LSTM-based language model using the text from the corpora in Table 3, and employed shallow fusion for decoding.

Through the four steps mentioned above, we obtained a high-performance multilingual ASR model with a total parameter size of 362M, which is substantially smaller than the Whisper larger model, making it more suitable for deployment.

3.3 Multilingual ASR evaluation

Due to the differences in the scale of training data, our primary benchmark for the multilingual ASR model is the Whisper larger-v2 model. We use the pipeline provided by HuggingFace333https://huggingface.co/openai/whisper-large-v3 for inference, testing both models where LID is provided in advance and where LID is not provided. Most previous papers have not tested the performance of Whisper without LID, and we believe that the results without LID are also meaningful. The recognition results for all languages were subjected to text normalization prior to the calculation of WER or CER. It is important to note that for Chinese, converting from traditional to simplified characters is necessary for calculation accuracy.

As shown in Table 4, our multilingual ASR model outperforms the Whisper medium and larger-v2 models across all languages, regardless of whether the LID is provided in advance or not, and was trained with less transcribed data. It’s worth mentioning that Whisper’s performance significantly declines on the Common Voice English test set when the LID is not specified beforehand. This performance dip can be largely ascribed to erroneous LID predictions, which exacerbate the inherent error propagation found in autoregressive models, culminating in less-than-ideal outcomes. On the other hand, our model demonstrates robustness and maintains stable performance, unaffected by the presence or absence of LID information. The results in Table 5 once again demonstrate that our model surpasses Whisper on the MSR-86K development set, which is indicative of the advanced nature of our algorithms.

4 Conclusions

In this paper, we introduce the MSR-86K corpus, an evolving, multilingual corpus with 86,300 hours of transcribed audio for speech recognition research. We believe that such a large-scale corpus will propel the research in multilingual speech algorithms. We also hope that more researchers will contribute to open-source data and work together to advance the development of the intelligent speech field. Additionally, we explain how to effectively leverage readily available unsupervised data, MSR-86K, and other open-source corpora to train a robust ASR model that is competitive with Whisper in terms of performance but smaller in size and faster, allowing everyone to use open-source data to train their own multilingual ASR model.

References

  • [1] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
  • [2] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  • [3] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2016, pp. 4960–4964.
  • [4] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
  • [5] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023, pp. 28 492–28 518.
  • [6] V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi et al., “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023.
  • [7] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2015, pp. 5206–5210.
  • [8] G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang et al., “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021.
  • [9] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech systems and assessment (O-COCOSDA).   IEEE, 2017, pp. 1–5.
  • [10] B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng et al., “Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6182–6186.
  • [11] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019.
  • [12] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A large-scale multilingual dataset for speech research,” arXiv preprint arXiv:2012.03411, 2020.
  • [13] M. J. Gales, K. M. Knill, A. Ragni, and S. P. Rath, “Speech recognition and keyword spotting for low-resource languages: Babel project research at cued,” in Fourth International workshop on spoken language technologies for under-resourced languages (SLTU-2014).   ISCA, 2014, pp. 16–23.
  • [14] A. W. Black, “Cmu wilderness multilingual speech dataset,” in IEEE ICASSP.   IEEE, 2019, pp. 5971–5975.
  • [15] C. Wang, A. Wu, and J. Pino, “Covost 2 and massively multilingual speech-to-text translation,” arXiv preprint arXiv:2007.10310, 2020.
  • [16] J. Iranzo-Sánchez, J. A. Silvestre-Cerda, J. Jorge, N. Roselló, A. Giménez, A. Sanchis, J. Civera, and A. Juan, “Europarl-st: A multilingual corpus for speech translation of parliamentary debates,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 8229–8233.
  • [17] R. Cattoni, M. A. Di Gangi, L. Bentivogli, M. Negri, and M. Turchi, “Must-c: A multilingual corpus for end-to-end speech translation,” Computer Speech & Language, vol. 66, p. 101155, 2021.
  • [18] E. Salesky, M. Wiesner, J. Bremerman, R. Cattoni, M. Negri, M. Turchi, D. W. Oard, and M. Post, “The multilingual tedx corpus for speech recognition and translation,” arXiv preprint arXiv:2102.01757, 2021.
  • [19] C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” arXiv preprint arXiv:2101.00390, 2021.
  • [20] Y. Jia, M. T. Ramanovich, Q. Wang, and H. Zen, “Cvss corpus and massively multilingual speech-to-speech translation,” arXiv preprint arXiv:2201.03713, 2022.
  • [21] L. Kürzinger, D. Winkelbauer, L. Li, T. Watzel, and G. Rigoll, “Ctc-segmentation of large corpora for german end-to-end speech recognition,” in International Conference on Speech and Computer.   Springer, 2020, pp. 267–278.
  • [22] D. Galvez, G. Diamos, J. Ciro, J. F. Cerón, K. Achorn, A. Gopi, D. Kanter, M. Lam, M. Mazumder, and V. J. Reddi, “The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage,” arXiv preprint arXiv:2111.09344, 2021.
  • [23] J. Du, X. Na, X. Liu, and H. Bu, “Aishell-2: Transforming mandarin asr research into industrial scale,” arXiv preprint arXiv:1808.10583, 2018.
  • [24] D. Wang and X. Zhang, “Thchs-30: A free chinese speech corpus,” arXiv preprint arXiv:1512.01882, 2015.
  • [25] Aidata, “https://openslr.magicdatatech.com/62/,” openslr, 2019.
  • [26] Y. Choi and B. Lee, “Pansori: Asr corpus generation from open online video contents,” arXiv preprint arXiv:1812.09798, 2018.
  • [27] TAL, “hhttps://ai.100tal.com/dataset,” TAL, 2021.
  • [28] C. Li, S. Deng, Y. Wang, G. Wang, Y. Gong, C. Chen, and J. Bai, “Talcs: An open-source mandarin-english code-switching corpus and a speech recognition baseline,” arXiv preprint arXiv:2206.13135, 2022.
  • [29] W. L. Lucas Jo, “https://www.openslr.org/40/,” openslr, 2019.
  • [30] J. Wirth and R. Peinl, “Asr bundestag: A large-scale political debate dataset in german,” arXiv preprint arXiv:2302.06008, 2023.
  • [31] N. Karpov, A. Denisenko, and F. Minkin, “Golos: Russian dataset for speech research,” arXiv preprint arXiv:2106.10161, 2021.
  • [32] Slizhikova, “Russian open speech to text dataset.” 2021.
  • [33] S. Takamichi, L. Kürzinger, T. Saeki, S. Shiota, and S. Watanabe, “Jtubespeech: corpus of japanese speech collected from youtube for speech recognition and speaker verification,” arXiv preprint arXiv:2112.09323, 2021.
  • [34] Y. Y. D. M. S. Fujimoto, “Reazonspeech: A free and massive corpus for japanese asr,” 2016.
  • [35] A. Ali, P. Bell, J. Glass, Y. Messaoui, H. Mubarak, S. Renals, and Y. Zhang, “The mgb-2 challenge: Arabic multi-dialect broadcast media recognition,” in 2016 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2016, pp. 279–284.
  • [36] H. Mubarak, A. Hussein, S. A. Chowdhury, and A. Ali, “Qasr: Qcri aljazeera speech resource–a large scale annotated arabic speech corpus,” arXiv preprint arXiv:2106.13000, 2021.
  • [37] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  • [38] S. Li, Y. You, X. Wang, K. Ding, and G. Wan, “Enhancing multilingual speech recognition through language prompt tuning and frame-level language adapter,” arXiv preprint arXiv:2309.09443, 2023.