spacing=nonfrench \interspeechcameraready \nameKrishna C.Puvvada \namePiotrŻelasko \nameHeHuang \nameOleksiiHrinchuk \nameNithin RaoKoluguri \nameKunalDhawan \nameSomshubraMajumdar \nameElenaRastorgueva \nameZhehuaiChen \nameVitalyLavrukhin \nameJagadeeshBalam \nameBorisGinsburg

Less is More: Accurate Speech Recognition & Translation
without Web-Scale Data

Abstract

Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models – Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.

keywords:
speech recognition, speech translation, FastConformer, multilingual speech model
11footnotetext: Equal contribution

1 Introduction

The landscape of automatic speech recognition (ASR) and automatic speech translation (AST) has been disrupted with the introduction of large scale multi-task models.

Whisper [1] is a transformer [2] attention encoder-decoder (AED) model [3] that has demonstrated impressive ASR and AST capabilities in 96 languages. It was trained initially with 680K hours of data and later extended to 5M hours, out of which 4M were transcribed by an earlier model version.

Seamless [4] is a multimodal streaming translation model supporting around 100 languages. It uses several components pretrained on over 4M unlabeled hours of speech, which are later fine-tuned jointly on 125k hours.

OWSM [5] is the first fully open-source attempt at reproducing Whisper model. It’s trained on 180k hours of publicly available data and supports 151 languages. The latest OWSM ver 3.1 adopted E-Branchformer architecture, achieving superior accuracy and speed [6].

Beyond impressive performance, transformer architecture, and size, what these models share in common are significant resource requirements: the amount of data and time required to train them. OWSM is the only work mentioned here that informs about training time and resources used: 16 days of training on 64 NVIDIA A100 40GB GPUs. It’s also trained on the least amount of data amongst models under discussion, hence we expect Whisper and Seamless would require even more resources and/or time. Another observation worth noting is that OWSM and Whisper train with a global batch size of 256 (increased to 1024 for Whisper v2 and v3), where each utterance is padded to be exactly 30s long. It was observed in [6] that such long utterances harm the model convergence. We also note that this approach may lead to significant padding in mini-batches, resulting in wasted computation on non-informative frames. We present an alternative approach to sampling and batching that allows us to iterate through data twice as fast, while balancing different languages and data sources better. We further accelerate the training and inference by adopting a FastConformer [7] architecture and initializing the encoder from a ASR only pretrained checkpoint.

Besides ASR, all models under discussion are capable of AST, i.e., they can transcribe the recording in any of the supported languages (except for Whisper, which only translates X\rightarrowen). There exists much less data for AST than for ASR, and creating such datasets usually involves translating the transcript into the target language. We demonstrate it is possible to train a strong AST model without using existing AST data at all: instead, we adopt a text machine translation model to automatically create AST supervisions for training from ASR data.

Key contributions of this work:

  • We introduce Canary, an open-source AED model capable of ASR and AST in English, French, Spanish, and German. Canary outperforms other multi-task AED models of similar size on established benchmarks.

  • We demonstrate that it is possible to train highly accurate speech translation models using only pseudo-labeled translation data.

  • We train Canary under 48 hours using 128 NVIDIA A100 80GB GPUs with a total of 225K weight updates by initializing the encoder from pre-trained weights and using dynamic bucketing batching techniques.

Using all techniques mentioned above, we train Canary using “only” 86K hours of speech. This is an order of magnitude less than amount of data used by Whisper and Seamless models.

2 Methods

Model architecture. Canary uses FastConformer encoder [7] and a Transformer decoder. FastConformer is a speech-specific modification of a transformer based on Conformer [8] that increases the downsampling factor to 8, achieving 2.8x speedup without loss of modeling capacity [7]. We train Canary with a cross-entropy objective.

Multi-task training and prompting. To achieve multi-task support, we condition the model’s predictions using prompts. Similarly to Whisper, we adopt the following special prompt tokens: <|startoftranscript|>, <|transcribe|>, <|translate|>, <|nospeech|>, <|endoftranscript|>, and an additional special token for each supported language. Our prompt is constructed similarly to Whisper’s, except we specify both input (audio) and output (text) languages as tokens before and after <|translate|>, respectively. We also add new special tokens <|pnc|> and <|nopnc|> at the end of the prompt control sequence to select whether the model should predict punctuation and capitalization, or not. We adopt SentencePiece [9] and concatenated tokenizer [10] with a vocabulary size of 1024 for each supported language and a 32-unit sub-tokenizer for the special tokens.

Dataset and language blending. Since the training data consists of multiple languages and diverse datasets, we aim to ensure consistent sampling of each throughout the training process. Failing to do so tends to result in training intervals where the model performs better on some domains/languages than others. Given N𝑁Nitalic_N datasets, we define a probability distribution P(d)𝑃𝑑P(d)italic_P ( italic_d ) of choosing the next example from a specific dataset d𝑑ditalic_d. To ensure this distribution remains stationary throughout training, we use a stochastic weighted multiplexing mechanism to combine the datasets. When constructing a mini-batch of samples, for each training example we first select the dataset according to P(d)𝑃𝑑P(d)italic_P ( italic_d ) and then sample an utterance from this dataset. In the expectation, each mini-batch would have a ratio of data coming from each source according to P(d)𝑃𝑑P(d)italic_P ( italic_d ).111CutSet.mux method at https://lhotse.readthedocs.io/ We consider weights as “natural” when they are proportional to the cumulative duration of each dataset. We also experiment with re-weighting strategies based on stratification by language and, within each language, by dataset. Additionally, we explore the use of temperature scaling applied to these dataset weights before sampling.222Note that this approach is compatible with optimized dataloading techniques that rely on sequential I/O (such as webdataset[11] or Lhotse Shar[12]) when each dataset is stored separately (e.g., as a separate collection of tar file shards).

Variable utterance length and dynamic batch sizes. We address the issue of duration distribution variability across utterances, typically ranging between one to forty seconds. We perform stratified sampling based on sample duration with a technique known as bucketing, where there are M𝑀Mitalic_M buckets, each containing utterances of similar duration, and any given mini-batch is sampled from just one bucket chosen randomly. Unlike most bucketing implementations that require partitioning data up-front, we leverage a dynamic bucketing technique. We estimate the optimal bucket bins (in the sense of equal bucket utilization given data duration distribution) up-front, but the allocation of utterances into buckets is done dynamically during training with a small utterance buffer in CPU RAM. The mini-batches are sampled to satisfy a maximum total duration, resulting in dynamic batch size.333DynamicBucketingSampler at https://lhotse.readthedocs.io/ We further account for quadratic sequence length complexity of the encoder by introducing a quadratic duration penalty. We find it helps equalize the GPU utilization across mini-batches from different buckets and improves the throughput. Thanks to this data-driven bucketing calibration, we typically observe only about 3% of padding in mini-batches compared to as much as 50% when using non-stratified sampling.

Improving robustness to hallucinations. In this work, we define hallucinations as producing transcriptions when input audio contains no speech. The severity of hallucinations is influenced by both the model architecture and the loss function employed. The tendency to hallucinate is a known flaw of AED models. We find that adding noise, music, and other non-speech data to the training dataset as a pseudo-language with its own weight significantly reduces hallucinations, though doesn’t eliminate them entirely.

3 Experimental setup

Training data. Canary was trained on a mixture of public and in-house datasets. Table 1 shows the composition of training data for the ASR task (81.5K hours in total). The public portion of English is composed of LibriSpeech, Fisher Corpus, Switchboard-1, WSJ-0 & WSJ-1, National Speech Corpus (Part 1, Part 6), VCTK, VoxPopuli (EN), Europarl-ASR (EN), Multilingual LibriSpeech (MLS)-EN (2k hour subset), Mozilla Common Voice (MCV)-7.0, People’s Speech (12K hour subset), MCV-11.0 (1.5k hour subset). 800 hour subset of MCV-12.0, 1.5K hour subset of MLS and 200 hour subset of VoxPopuli were gathered from public sources for German. For Spanish, 395 hours from MCV-12.0, 780 hours from MLS, 108 hours from VoxPopuli and 141 hours from Fisher were collected from public domain. Similarly for French, 708 hour subset from MCV-12.0, 926 hours from MLS and 165 hours from VoxPopuli were used from public domain. Table 1 also shows number of hours with punctuation and capitalization (PnC) for each language. PnC was obtained from respective dataset metadata when available (e.g. LibriSpeech). Alternatively, PnC was restored using neural models for some datasets. PnC data was further processed to remove all punctuation marks except five (’,?.!). Text normalization was applied to ground truth. 300 hours of non-Speech data (AudioSet strongly-labelled subset portion of [13]) is used to improve model robustness. Data for AST was solely obtained by generating synthetic labels using Neural Machine Translation models [14, 15] without using additional datasets. 43K hours of English ASR data from Table 1 was used to generate training data for English\rightarrowX (X: German, Spanish, French). All available data from Table 1 for German, Spanish, French languages was used to prepare X \rightarrow English direction of translation data. Further, a 4.8K hour English \rightarrow German translation dataset from [16] was also used, which in-itself was also pseudo-labeled, bringing the total size to 86.3K hours.

Test data. Test sets from MCV-16.1, MLS, and VoxPopuli were used to measure ASR performance across all languages. The translation capabilities of the models were examined using FLEURS, mExpresso, and CoVoST-v2. Additionally, standard English test sets such as AMI, Earnings22, Gigaspeech, LibriSpeech (test-clean & test-other), SPGI, and Tedlium were utilized to evaluate model generalization across different domains. The Music and Noise subsets (a total of 48 hours) from MUSAN [17] were used to assess robustness to hallucinations.

Table 1: Training data statistics with breakdown per language and availability of punctuation and capitalization (PnC).
Language
Public
+ In-house [K hours]
PnC
[K hours]
Dur. [s]
[Min, Max]
# Utterances
[M]
English 25.5 + 37.9 38.5 [1, 40] 24
German 2.5 + 3.6 6.1 [1, 20] 2.4
Spanish 1.4 + 5.2 1.4 [1, 20] 3.8
French 1.8 + 3.3 1.8 [0.5, 40] 2.5
Non-speech 0.3 + 0.0 NA [0.47, 10] 0.1
Total 31.5 + 50 47.8 NA 32.8
Table 2: WER (%) results on ASR benchmarks. Baseline model predictions are generated using respective public checkpoints. Ground-truth and predictions are normalized using WhisperNormalizer[1]. Canary achieves lowest WER in 10 out of 12 test sets across all languages.
Model (WER \downarrow) En De Es Fr
MCV-16.1 MLS VoxPopuli MCV-16.1 MLS VoxPopuli MCV-16.1 MLS VoxPopuli MCV-16.1 MLS VoxPopuli
OWSM-v3.1 (1.02B) 11.87 5.37 7.04 9.24 10.49 16.25 9.59 8.84 10.17 13.69 11.75 12.61
SeamlessM4T-medium (1.2B) 10.25 7.05 6.06 9.32 8.12 12.95 7.25 5.25 7.31 11.07 7.32 8.77
SeamlessM4T-large-v2 (2.3B) 7.47 4.14 4.68 5.82 6.08 10.68 4.82 4.14 6.76 7.75 5.38 6.82
Whisper-large-v3 (1.5B) 9.92 3.53 6.23 6.17 5.83 16.50 4.94 4.42 8.01 11.18 5.38 7.52
Canary (1B) 7.83 3.03 4.42 4.49 4.09 10.70 3.88 3.12 5.02 6.37 4.06 5.48
Refer to caption
Figure 1: Word error rate on 12 test sets for the proposed model and 4 baselines. Vertical bars indicate 95% confidence intervals obtained from boostrap method with 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT replications [18].

Training settings. Canary uses FastConformer encoder of XL size from  [7]. Along with convolution sub-sampling block, it has 24 conformer layers with model dimension 1024, projection dimension 4096, convolution kernel size 9 and 8 attention heads, with a total parameter count of 0.6B. The decoder is a regular transformer decoder [2], with 24 layers of dimension 1024, projection dimension 4096, and 8 attention heads, with a total parameter count of 0.4B. The decoder uses fixed-positional encoding. The encoder consumes audio in the form of 128-dim log-mel features extracted every 10 msec over a window of 25 msec (preliminary experiments didn’t show significant difference between 80-dim and 128-dim log-mel features for ASR, but 128-dim features showed consistent improvement for AST).
Lhotse [19] was used for dataloading with a batch duration of 360 sec per GPU, quadratic_duration of 20 sec, buffer_size of 20000, shuffle_buffer_size of 10000 and num_buckets as 31. To balance multiple languages and datasets, training examples were sampled based on the probability distribution ps(nsN)αsimilar-tosubscript𝑝𝑠superscriptsubscript𝑛𝑠𝑁𝛼p_{s}\sim\left(\frac{n_{s}}{N}\right)^{\alpha}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∼ ( divide start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT, where s𝑠sitalic_s represents a stratum (e.g., a language or a dataset), nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the number of hours for stratum s𝑠sitalic_s, N𝑁Nitalic_N is the total number of hours, and α𝛼\alphaitalic_α is the up-sampling factor [20]. We implemented a two-level hierarchical stratification of the training corpus: initially at the language level and subsequently within each language by dataset. The final weight assigned to a dataset is the product of these two probabilities. In both stratification levels, we set α=0.5𝛼0.5\alpha=0.5italic_α = 0.5. Non-speech audio has been treated as a separate language for the purposes of sampling.
The model was trained in 2 stages using NVIDIA NeMo [21] framework. In stage-1, the model was trained for 150K updates. We used AdamW with a peak learning rate (LR) of 3e-4 and weight decay of 1e-3. The learning rate was warmed up over 2.5K steps and annealed as per Noam scheduling. The encoder was initialized from an XL version of [22], whose training data was a subset of Table 1. Encoder initialization helped model converge faster and achieve better metrics overall. The decoder was random initialized. Stage-2 was trained for 75K updates with a peak LR of 2e-5, with remaining hyperparameters being same as stage-1. The main difference between both stages is the inclusion of Non-speech dataset from Table 1 in stage-2 (Note that this 2 stage training is merely practical convenience to allow for faster experimentation wrt to robustness and is not a necessity). In both stages, 128 A100 (80GB) GPUs were used.

4 Results

We evaluate the Canary model on speech recognition (ASR) and speech-to-text translation (AST), and show the results in Table 2 and Table 4 respectively. We use Whisper, OWSM-v3.1, and SeamlessM4T as baselines by using their official checkpoints and re-running the models on the same test sets. All models use beam search decoding with beam width 5.

Table 3: Comparing Canary with SOTA models across different domains for English ASR [23]. Canary achieves best average WER exhibiting generalizability across domains.
Model (WER \downarrow) AMI Earnings22 GigaSpeech LS Clean LS Other SPGISpeech Tedlium Avg. WER
Whisper-large-v3 16.01 11.3 10.02 2.03 3.91 2.95 3.9 7.16
Parakeet-RNNT-1.1B 17.1 15.15 9.96 1.46 2.48 3.11 3.92 7.60
Parakeet-TDT-1.1B 15.9 14.65 9.55 1.39 2.62 3.42 3.56 7.30
Canary (1B) 13.53 12.05 10.07 1.47 2.86 2.02 3.53 6.50
Table 4: BLEU scores on speech translation (AST) benchmarks. Annotations from the corresponding datasets come with native punctuation and capitalization and are used without further processing/normalization. SeamlessM4T-large-v2 (2.3B) achieves the best overall BLEU scores. Canary (1B) matches or outperforms models of comparable size.
Model (BLEU \uparrow) FLEURS (En \rightarrow X) mExpresso (En \rightarrow X) FLEURS (X \rightarrow En) COVOST-v2 (X \rightarrow En)
En \rightarrow De En \rightarrow Es En \rightarrow Fr En \rightarrow De En \rightarrow Es En \rightarrow Fr De \rightarrow En Es \rightarrow En Fr \rightarrow En De \rightarrow En Es \rightarrow En Fr \rightarrow En
OWSM-v3.1 (1.02B) 24.37 11.39 16.39 19.29 10.98 8.59 13.22 9.35 12.38 18.05 23.90 24.47
SeamlessM4T-medium (1.2B) 28.30 21.05 37.36 9.65 16.23 8.64 33.39 21.68 30.94 35.60 39.18 39.27
SeamlessM4T-large-v2 (2.3B) 33.17 23.72 43.05 21.48 34.89 26.04 37.06 25.41 33.70 39.96 42.91 42.12
Whisper-large-v3 (1.5B) N/A N/A N/A N/A N/A N/A 33.40 22.70 31.02 34.22 39.20 35.49
Canary (1B) 32.13 22.06 39.50 24.42 35.76 27.96 33.70 22.06 31.57 37.92 40.79 40.58

4.1 Automatic Speech Recognition (ASR)

We evaluate all models across four languages on MCV-16.1 [24], MLS [25] and VoxPopuli [26] test sets. For the baselines, we input the audios and their corresponding language IDs to the models’ inference APIs. For Canary, we additionally include the special token <|pnc|> to ensure all models produce text with punctuation and capitalization. We then normalize the ground-truth and predictions using the Whisper-Normalizer [1] before calculating the word error rate (WER).

Table 2 shows that the Canary model achieves the lowest WER in 10 out of 12 test sets across all languages. On average, Canary model achieves 6.20% WER on English, 6.27% WER on German, 4.09% WER on Spanish and 5.39% WER on French. In comparison, the second best model, SeamlessM4T-large-v2, with twice as many parameters as ours, achieves 5.42% WER on English, 7.53% WER on German, 5.24% WER on Spanish and 6.65% WER on French. Figure 1 shows that the 95% confidence intervals do not overlap for most test sets and systems, indicating that the WER improvements observed for Canary in Table 2 are statistically significant.

This demonstrates the advantage of the Canary model, achieving state-of-the-art multi-lingual ASR performance with fewer parameters and less training data than contemporary models.

Further, Canary achieves the best average WER of 6.5% across different test sets, highlighting its superior generalization capabilities in English ASR (Table 3).

4.2 Speech-to-text Translation (AST)

To evaluate translating English audios to other languages (En \rightarrow X), we use FLEURS [27] and mExpresso [4], whereas FLEURS [27] and CoVoST [28] were used to evaluate translating audio from other languages to English (X \rightarrow En). Annotations from all test sets have punctuation and capitalization and are used without additional processing.

From Table 4, we notice that SeamlessM4T-large-v2 achieves the highest BLEU scores on all except mExpresso, which is expected given it has the highest number of parameters and the largest size of training data. Meanwhile, Canary model outperforms the SeamlessM4T-medium baseline, which shares similar parameter count but trained on more data, on all test sets. Compared with Whisper-large-v3, Canary achieves better results on CoVoST-2 and comparable performance on FLEURS. In addition, Canary is able to translate English audios into other languages, while Whisper-large-v3 cannot. From these results, we can see that, despite being the smallest model in its class, the Canary model achieves competitive performance on speech-to-text translation.

4.3 Long-form ASR Inference

We investigate the performance of the Canary model on long-form audio by chunking long audios into non-overlap** 30-second segments, performing inference on each segment, and then stitching the transcripts together. We use the FastConformer [29] as our baseline, and show the results on Tedlium3 [30], Earnings21 [31] and Earnings22 [32] in Table 5. WER’s for baselines are copied from the original paper [29]. We can see that, although chunking is a naive method, Canary is achieves lowest WER in transcribing long-form audios. Meanwhile, adding streaming capability to Canary remains a direction for future research.

Table 5: WER(%) on long-form ASR inference, both ground-truth and predictions were processed by WhisperNormalizer [1]. All models use greedy decoding. The FastConformer baseline uses a streaming mechanism, while the Canary model uses simple chunking without overlap. Canary achieves lowest WER.
Model (WER \downarrow) Tedlium3 Earnings21 Earnings22
FastConformer-CTC (FT+LCA+GT) [29] 5.53 15.61 22.37
FastConformer-RNNT (FT+LCA+GT) [29] 4.98 13.84 19.49
Canary (1B) 4.68 11.34 14.34

4.4 Hallucination Robustness

The robustness of ASR models is evaluated on many axes, such as robustness to noise, music, background speech, and multiple speakers talking simultaneously. For AED models trained with the next token prediction objective, a particularly less-studied failure case is the generation of spurious transcripts when an audio sample is provided with long periods of silence (or contains no speech). In such cases, the expected output from the model should be an empty transcript. However, autoregressive AED models often hallucinate an unaligned transcript, especially when trained on web-scale data with insufficient filtering. In this work, we investigate the frequency of such hallucinations in the Canary model. To that end, we transcribe recordings without any speech.

Table 6 compares the number of hallucinated characters per minute produced by Canary with and without noise-robust training (utilizing the strongly-labeled subset of AudioSet from [13]). We also include Whisper-large-v3 as an additional baseline. As shown in the table, Canary generates 16.7% fewer hallucinated characters than Whisper-large-v3, even without noise-robust training. With noise-robust training, Canary further reduces its hallucinated characters by another 26.

Table 6: Number of hallucinated characters per min, measured using 48-hour non-speech audio subset from MUSAN [17]. Canary hallucinates the least. (Note that there is some vocals in MUSAN audios, so the actual number of hallucinated characters may be smaller.)
Model # Hallucinated Chars / min (\downarrow)
Whisper-large-v3 114.8
Canary (1B) w/o noise
robust training
95.61
Canary (1B) 70.75

5 Conclusions

In this work, we present Canary, a FastConformer-based encoder-decoder ASR and AST model for English, German, Spanish, and French, outperforming similarly sized models on established benchmarks. We demonstrate that it is possible to match or exceed the performance of contemporary AST models using solely pseudo-labeled translation data. Canary was trained using 86K hours of data—an order of magnitude less than contemporary models—while achieving comparable or superior metrics. We describe effective training techniques, including encoder initialization, data balancing, and dynamic bucketing batching, enabling us to train the model in under two days. The model and code will be open-sourced through NeMo [21].

References

  • [1] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv:2212.04356, 2022.
  • [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeuRIPS, 2017.
  • [3] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in ICLR, 2015.
  • [4] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, M. Duppenthaler, P.-A. Duquenne, B. Ellis, H. Elsahar, J. Haaheim et al., “Seamless: Multilingual expressive and streaming speech translation,” arXiv:2312.05187, 2023.
  • [5] Y. Peng, J. Tian, B. Yan, D. Berrebbi, X. Chang, X. Li, J. Shi, S. Arora, W. Chen, R. Sharma et al., “Reproducing whisper-style training using an open-source toolkit and publicly available data,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
  • [6] Y. Peng, J. Tian, W. Chen, S. Arora, B. Yan, Y. Sudo, M. Shakeel, K. Choi, J. Shi, X. Chang et al., “Owsm v3. 1: Better and faster open whisper-style speech models based on e-branchformer,” arXiv:2401.16658, 2024.
  • [7] D. Rekesh, N. R. Koluguri, S. Kriman, S. Majumdar, V. Noroozi, H. Huang, O. Hrinchuk, K. Puvvada, A. Kumar, J. Balam et al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2023, pp. 1–8.
  • [8] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech, 2020.
  • [9] T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in EMNLP: System Demonstrations, 2018.
  • [10] K. Dhawan, K. Rekesh, and B. Ginsburg, “Unified model for code-switching speech recognition and language identification based on concatenated tokenizer,” in Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, 2023.
  • [11] “webdataset,” https://webdataset.github.io/webdataset/, accessed: 2024-03-08.
  • [12] “Lhotse shar: Storage format optimized for sequential i/o and modularity,” https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/04-lhotse-shar.ipynb.
  • [13] C. D. Kim, B. Kim, H. Lee, and G. Kim, “AudioCaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).   Minneapolis, Minnesota: Association for Computational Linguistics, 2019.
  • [14] NVIDIA, “Megatron multilingual model,” https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/megatronnmt_en_any_500m.
  • [15] ——, “Megatron multilingual model,” https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/megatronnmt_any_en_500m.
  • [16] O. Hrinchuk, V. Bataev, E. Bakhturina, and B. Ginsburg, “NVIDIA NeMo offline speech translation systems for IWSLT 2023,” in IWSLT, E. Salesky, M. Federico, and M. Carpuat, Eds., 2023.
  • [17] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” arXiv:1510.08484, 2015.
  • [18] M. Bisani and H. Ney, “Bootstrap estimates for confidence intervals in asr performance evaluation,” ICASSP, 2004.
  • [19] P. Żelasko, D. Povey, J. Trmal, and S. Khudanpur, “Lhotse: a speech data representation library for the modern deep learning ecosystem,” in NeurIPS Data-Centric AI Workshop, 2021.
  • [20] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino et al., “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” arXiv:2111.09296, 2021.
  • [21] E. Harper et al., “NeMo: a toolkit for Conversational AI and Large Language Models.” [Online]. Available: https://github.com/NVIDIA/NeMo
  • [22] NVIDIA, “Stt european fastconformer hybrid transducer-ctc large pnc,” https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_multilingual_fastconformer_hybrid_large_pc_blend_eu, 2023.
  • [23] V. Srivastav, S. Majumdar, N. Koluguri, A. Moumen, S. Gandhi et al., “Open automatic speech recognition leaderboard,” https://huggingface.co/spaces/hf-audio/open_asr_leaderboard, 2023.
  • [24] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Conference on Language Resources and Evaluation (LREC ), 2020.
  • [25] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A large-scale multilingual dataset for speech research,” arXiv:2012.03411, 2020.
  • [26] C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” in Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 993–1003.
  • [27] A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in Spoken Language Technology Workshop (SLT), 2023, pp. 798–805.
  • [28] C. Wang, A. Wu, and J. Pino, “CoVoST 2: A massively multilingual speech-to-text translation corpus,” arXiv:2007.10310, 2020.
  • [29] N. R. Koluguri, S. Kriman, G. Zelenfroind, S. Majumdar, D. Rekesh, V. Noroozi, J. Balam, and B. Ginsburg, “Investigating end-to-end asr architectures for long form audio transcription,” arXiv:2309.09950, 2023.
  • [30] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Esteve, “Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” in Speech and Computer: 20th International Conference, SPECOM, 2018.
  • [31] M. Del Rio, N. Delworth, R. Westerman, M. Huang, N. Bhandari, J. Palakapilly, Q. McNamara, J. Dong, P. Zelasko, and M. Jetté, “Earnings-21: A practical benchmark for asr in the wild,” arXiv:2104.11348, 2021.
  • [32] M. Del Rio, P. Ha, Q. McNamara, C. Miller, and S. Chandra, “Earnings-22: A practical benchmark for accents in the wild,” arXiv:2203.15591, 2022.