OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Yifan Peng
Carnegie Mellon University
[email protected]
&Yui Sudo
Honda Research Institute Japan
[email protected]
\ANDMuhammad Shakeel
Honda Research Institute Japan
[email protected]
&Shinji Watanabe
Carnegie Mellon University
[email protected]

Abstract

There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID). Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 24% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference. OWSM-CTC also improves the long-form ASR result with 20x speed-up. We will publicly release our code, pre-trained model, and training logs to promote open science in speech foundation models.¹¹1https://github.com/espnet/espnet

Yifan Peng Carnegie Mellon University [email protected] Yui Sudo Honda Research Institute Japan [email protected]

Muhammad Shakeel Honda Research Institute Japan [email protected] Shinji Watanabe Carnegie Mellon University [email protected]

1 Introduction

(a) English speech recognition

(b) X-to-En speech translation

Figure 1: Performance vs. speed for encoder-decoder OWSM v3.1 and our encoder-only OWSM-CTC.

The great success of large language models (LLMs) (OpenAI, 2023; Touvron et al., 2023; Anil et al., 2023b) has sparked a growing interest in develo** foundation models in various modalities. Recent studies have explored different approaches towards multilingual and multi-tasking speech foundation models (Radford et al., 2023; Zhang et al., 2023; Pratap et al., 2023; Rubenstein et al., 2023; Barrault et al., 2023; Peng et al., 2023e). OpenAI Whisper (Radford et al., 2023) is a series of Transformer encoder-decoder models trained on 680k hours of proprietary labeled audio. Whisper achieves strong results in multilingual automatic speech recognition (ASR), any-to-English speech translation (ST), and spoken language identification (LID). Although it shows the effectiveness of large-scale (weakly) supervised pre-training, the full development pipeline, including training data details, is not publicly accessible. Recent works have developed Open Whisper-style Speech Models (OWSM) (Peng et al., 2023e, 2024) with the aim of reproducing Whisper-style training using public data and open-source toolkits. However, Whisper and OWSM adopt the encoder-decoder architecture, which generates text tokens given speech in an autoregressive manner. They might hallucinate during inference, and the speed can be slow. Other models with decoder-only architectures, like AudioPaLM (Rubenstein et al., 2023) and VioLA (Wang et al., 2023b), could suffer from the same issues due to autoregressive decoding.

Another type of work like Google USM (Zhang et al., 2023) and Meta MMS (Pratap et al., 2023) uses non-autoregressive models with Connectionist Temporal Classification (CTC) Graves et al. (2006), but these CTC-based models are designed for ASR only. Prior studies have also achieved promising results of CTC models for ST only, but they mainly focus on specific language pairs at much smaller scales (Inaguma et al., 2021; Chuang et al., 2021; Xu et al., 2023). Some of them employ additional decoders (Inaguma et al., 2021; Yan et al., 2023) or cross-attention layers (Xu et al., 2023), making the model more complicated.

A natural question now arises: Can we build a non-autoregressive encoder-only model for speech-to-text generation in diverse languages and multiple tasks like Whisper/OWSM? This research problem has become increasingly important in the era of LLMs because large-scale pre-trained speech encoders can serve as an adapter between the speech and text modalities (Gong et al., 2023; Wang et al., 2023a), providing a promising avenue towards general-purpose multi-modal foundation models (Anil et al., 2023a).

In this work, we propose OWSM-CTC, a novel encoder-only speech foundation model based on multi-task self-conditioned CTC to imitate OWSM’s multilingual ASR, any-to-any ST, and LID functionalities. Following previous encoder-decoder OWSM v3.1 models (Peng et al., 2024), we train a 1B OWSM-CTC model using 180k hours of public data covering 151 languages. Extensive evaluations show that our OWSM-CTC exhibits strong performance and efficiency. Compared to the 1B OWSM v3.1 medium model, OWSM-CTC achieves comparable performance for ASR and superior performance for various ST directions (up to 24% relative improvement) while being more robust and showing 3 to 4 times inference speed-up. OWSM-CTC also improves the WER for long-form ASR and can be 20 times faster due to batched parallel decoding. OWSM-CTC further outperforms the other baseline models on LID. Our code, pre-trained model weights, and training logs will be publicly released to facilitate the development of large speech models.

2 Related Work

2.1 Speech foundation models

Attention-based encoder-decoder. OpenAI Whisper (Radford et al., 2023) adopts the standard Transformer encoder-decoder architecture (Vaswani et al., 2017) and scales the training data to 680k hours of proprietary labeled audio.²²2Their latest large-v3 version uses 1M hours of labeled audio and 4M hours of pseudo-labeled audio. However, the complete pipeline for model development, including training data details and training code, is not publicly available. A recent project, OWSM, aims to reproduce Whisper-style training using public data and open-source toolkits to promote transparency and open science in this field (Peng et al., 2023e). The latest OWSM v3.1 models (Peng et al., 2024) employ E-Branchformer (Kim et al., 2023) as the encoder and Transformer as the decoder, which are trained with a joint ASR CTC loss (Kim et al., 2017). Although OWSM has promising results using public corpora, it still follows the encoder-decoder architecture, which can be slow and unstable at inference time.

Decoder-only. Several studies employ decoder-only models for speech-to-text tasks. AudioPaLM (Rubenstein et al., 2023) extends the textual PaLM-2 (Anil et al., 2023b) to support speech understanding and generation tasks including ASR and ST. DOTA (Gupta et al., 2024) is a decoder-only Transformer model trained on 93k hours of public English ASR data, but it does not support other languages or ST. Decoder-only models face the same slowness and robustness issues as encoder-decoder due to autoregressive decoding.

CTC or Transducer. Another line of research proposes to utilize CTC (Graves et al., 2006) or Transducer (Graves, 2012) for ASR. Google USM (Zhang et al., 2023) provides generic ASR models that are first pre-trained on 12M hours of unlabeled audio and then fine-tuned on proprietary labeled data with CTC or Transducer. Meta MMS (Pratap et al., 2023) pre-trains a wav2vec 2.0 model (Baevski et al., 2020) on massively multilingual data and then fine-tunes it with CTC on labeled ASR data covering over 1k languages. These models employ CTC only for ASR. In our OWSM-CTC, we propose a single CTC-based encoder-only model for ASR, ST, and LID. Our supported tasks are more similar to Whisper-style models.

2.2 Efficient speech models

Model compression. Various algorithms have been utilized to compress speech models, including knowledge distillation (Chang et al., 2022; Lee et al., 2022; Peng et al., 2023d; Gandhi et al., 2023), pruning (Lai et al., 2021; Peng et al., 2023a), quantization (Yeh et al., 2023; Ding et al., 2023), and dynamic module execution (Yoon et al., 2022; Peng et al., 2023c; Strimel et al., 2023). These methods are typically applied to pre-trained models and are thus orthogonal to this work. In the future, we will apply compression to further improve efficiency.

Efficient architectures. Better network architectures can also improve efficiency, including attention with linear complexity (Beltagy et al., 2020; Wang et al., 2020b; Tay et al., 2023) and sequence length reduction (Burchi and Vielzeuf, 2021; Kim et al., 2022; Nawrot et al., 2023; Rekesh et al., 2023). In this work, we do not modify the attention but use larger downsampling in the convolution module to reduce the sequence length. More details are in Appendix A.2 and B.1.

Refer to caption — Figure 2: Architecture of our OWSM-CTC. For an input audio, it predicts a language token along with ASR or ST text tokens depending on the task specifier. An optional text prompt can be provided, which mimics Whisper.

2.3 CTC-based speech models

Non-autoregressive models have a faster inference speed than their autoregressive counterparts due to parallel decoding. They have been utilized in machine translation (Gu et al., 2018; Ghazvininejad et al., 2019; Xiao et al., 2023), ASR (Chen et al., 2019; Higuchi et al., 2020; Ng et al., 2021; Chi et al., 2021; Lee and Watanabe, 2021; Nozaki and Komatsu, 2021), and ST (Inaguma et al., 2021; Chuang et al., 2021; Xu et al., 2023).

CTC is originally proposed to label sequences without explicit segmentation (Graves et al., 2006). CTC-based ASR models learn a monotonic alignment between speech features and text tokens. With parallel greedy decoding, they are much faster than autoregressive models. However, the accuracy of CTC is generally inferior due to the conditional independence assumption between output tokens. To address this issue, Intermediate CTC (InterCTC) (Lee and Watanabe, 2021) calculates additional CTC losses using intermediate representations from the encoder. Self-conditioned CTC Nozaki and Komatsu (2021) further extends InterCTC by adding back predictions of intermediate CTC layers to the subsequent encoder. These approaches have shown to be highly effective in speech-to-text generation tasks without a decoder (Higuchi et al., 2021).

Although CTC assumes a monotonic alignment between input and output, it can be used for ST with the reordering capability of self-attention (Inaguma et al., 2021; Chuang et al., 2021).

Conventional CTC models are typically designed for a specific task or language. It remains under-explored whether such approaches can be scaled to multilingual and multi-task scenarios. This work proposes a novel encoder-only speech foundation model based on multi-task self-conditioned CTC. This single model performs well in multilingual ASR, ST, and LID.

3 OWSM-CTC

3.1 Overall architecture

Figure 2 shows the architecture of OWSM-CTC. Its main component is a speech encoder, which takes speech features as input and predicts the spoken language as well as the ASR or ST hypothesis using CTC. To mimic Whisper-style models that condition text generation on an optional text prompt (Radford et al., 2023; Peng et al., 2023e, 2024), we employ a separate Transformer encoder to process the prompt and inject the output to the main model through cross-attention. Then, the model can potentially attend to the text prompt when generating text.

3.2 Speech encoder

For an input waveform, we first extract log Mel filterbanks and then apply a 2D convolution module to downsample the feature sequence along the time dimension. Let $\mathbf{X}_{\text{speech}}\in\mathbb{R}^{T\times d}$ be the downsampled feature sequence of length $T$ and feature size $d$ . To specify the language and task, we prepend two special tokens to the sequence:

\displaystyle\mathbf{X}=\text{concat}(\mathbf{e}_{\text{lang}},\mathbf{e}_{% \text{task}},\mathbf{X}_{\text{speech}}),

(1)

where $\text{concat}(\cdot)$ is concatenation along time and $\mathbf{e}_{\text{lang}},\mathbf{e}_{\text{task}}\in\mathbb{R}^{1\times d}$ are embeddings of special tokens <lang> and <task>, respectively. $\mathbf{X}$ now has shape $(T+2)\times d$ . If the spoken language is known, the true language token will be used as input. Otherwise, a special token <nolang> denoting “unknown language” will be used. During training, we randomly replace the true language with <nolang> according to probability 0.5 so that either can be used for inference. The task token is <asr> for speech recognition and <st_lang> for translation to a target language.

Next, we add sinusoidal positional embeddings to $\mathbf{X}$ , and apply a stack of $N$ encoder layers:

	$\displaystyle\mathbf{X}^{(0)}$	$\displaystyle=\mathbf{X}+\text{PosEmb}(\mathbf{X}),$		(2)
	$\displaystyle\mathbf{X}^{(l)}$	$\displaystyle=\text{SpeechEnc}^{(l)}(\mathbf{X}^{(l-1)}),$		(3)

where $l$ is a layer index from 1 to $N$ , $\text{PosEmb}(\cdot)$ generates positional embeddings, and $\text{SpeechEnc}^{(l)}(\cdot)$ is the $l$ -th encoder layer. The encoder is E-Branchformer (Kim et al., 2023), an enhanced version of Branchformer (Peng et al., 2022), which shows excellent performance across a wide range of benchmarks (Peng et al., 2023b).

We compute the CTC loss using the final encoder output $\mathbf{X}^{(N)}$ and an augmented reference $\mathbf{y}_{\text{task}}$ . To create this reference, we simply preprend <lang> and <task> to the original groundtruth text of the desired task. Hence, the model will learn to predict the language token in addition to ASR or ST text tokens. This CTC loss is denoted as follows:

(4)

where $\mathbf{W}_{1}\in\mathbb{R}^{d\times V}$ is a linear layer and $V$ is the size of the CTC vocabulary.

As discussed in Section 2.3, we apply self-conditioned CTC (Nozaki and Komatsu, 2021) at intermediate layers $\mathcal{S}\subseteq\{1,\ldots,N-1\}$ to alleviate the conditional independence assumption of CTC. For any layer $s\in\mathcal{S}$ , Equation 3 is replaced by the following operations:

$\displaystyle\mathbf{A}^{(s)}$	$\displaystyle=\text{SpeechEnc}^{(s)}(\mathbf{X}^{(s-1)}),$	(5)
$\displaystyle\mathbf{B}^{(s)}$	$\displaystyle=\text{softmax}(\mathbf{A}^{(s)}\mathbf{W}_{1}),$	(6)
$\displaystyle\mathbf{X}^{(s)}$	$\displaystyle=\mathbf{A}^{(s)}+\mathbf{B}^{(s)}\mathbf{W}_{2},$	(7)

where $\mathbf{W}_{2}\in\mathbb{R}^{V\times d}$ is a linear layer. The intermediate CTC loss at layer $s$ is defined as follows:

\displaystyle\mathcal{L}^{(s)}=-\log P_{\text{CTC}}(\mathbf{y}^{(s)}\mid% \mathbf{B}^{(s)}),

(8)

where $\mathbf{y}^{(s)}$ is the augmented reference at layer $s$ . Similar to $\mathbf{y}_{\text{task}}$ in Equation 4, we prepend the language and task tokens to the original groundtruth text. Note that the choice of the reference text depends on the task. If the task for the current input is ASR, we simply use the ASR transcript to create $\mathbf{y}^{(s)}$ for all $s$ , which is consistent with conventional ASR models. However, if the task is ST, we empirically find that the model cannot converge if we use the translated text as the reference at all intermediate layers $\mathcal{S}$ (see Appendix B.2 for discussions). Therefore, as shown in Figure 2, we utilize the ASR transcript at the first $N_{\text{ASR}}$ layers and the ST text at the remaining $N_{\text{ST}}$ layers, where $N_{\text{ASR}}+N_{\text{ST}}=|\mathcal{S}|\leq N-1$ . This design mimics a cascaded system that first performs ASR and then ST, but our entire model is optimized jointly and trained from scratch. In other words, the first $N_{\text{ASR}}$ CTC layers always perform ASR regardless of the task token (named “ASR-only CTC”), whereas the other CTC layers are multi-tasking - they can perform ASR or ST according to the task token (named “task-specific or task-dependent CTC”).

The overall training loss is an average of the loss terms defined in Equation 4 and Equation 8:

\displaystyle\mathcal{L}_{\text{total}}=\frac{1}{1+|\mathcal{S}|}\left(% \mathcal{L}^{(N)}+\sum_{s\in\mathcal{S}}\mathcal{L}^{(s)}\right).

(9)

3.3 Prompt encoder

Whisper-style models generate text conditioned on an optional text prompt (Radford et al., 2023; Peng et al., 2023e, 2024). During training, this prompt is simply the previous sentence in the same audio recording. During inference, it can be provided by the user to potentially adjust the output. For encoder-decoder models like Whisper, the text prompt is a prefix to the autoregressive decoder. For our encoder-only model, we leverage a separate Transformer encoder to process the prompt and inject it to the speech encoder through cross-attention. If no prompt is provided, a special token <na> will be used. Let $\mathbf{X}_{\text{prompt}}\in\mathbb{R}^{T^{\prime}\times d^{\prime}}$ be the output of the prompt encoder. We insert a cross-attention layer at a subset of layers $\mathcal{T}\subseteq\{1,\ldots,N\}$ of the speech encoder. For any $t\in\mathcal{T}$ , the original $\text{SpeechEnc}^{(t)}(\cdot)$ in Equation 3 or Equation 5 becomes $\text{SpeechEncCA}^{(t)}(\cdot,\cdot)$ :

	$\displaystyle\mathbf{D}^{(t)}=\text{SpeechEnc}^{(t)}(\mathbf{X}^{(t-1)}),$		(10)
	$\displaystyle\text{SpeechEncCA}^{(t)}(\mathbf{X}^{(t-1)},\mathbf{X}_{\text{% prompt}})=$
	$\displaystyle~{}~{}~{}~{}\mathbf{D}^{(t)}+\text{CrossAtt}(\mathbf{D}^{(t)},% \mathbf{X}_{\text{prompt}},\mathbf{X}_{\text{prompt}}),$		(11)

where $\text{CrossAtt}(\cdot,\cdot,\cdot)$ is a cross-attention layer with three arguments: query, key, and value.

Our training data is a mixture of public ASR and ST datasets. Some of them provide unsegmented long audio, but the others only release segmented short audio. At training time, if a sample does not have a previous sentence, we will use <na>. Otherwise, we use either <na> or the previous sentence as the prompt according to 0.5 probability. Section 4.6 shows that OWSM-CTC can leverage the prompt’s information when necessary.

Whisper (encoder-decoder) (Radford et al., 2023)
	Params	Time shift	Training data	GPU hours
base	74M	20ms	680k hours	unknown
small	244M	20ms	680k hours	unknown
medium	769M	20ms	680k hours	unknown
large-v2	1550M	20ms	680k hours	unknown
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base	101M	40ms	180k hours	2.3k
medium	1.02B	40ms	180k hours	24.6k
OWSM-CTC (ours)
medium	1.01B	80ms	180k hours	19.2k

Table 1: Summary of model size, training data, and training cost measured on an NVIDIA A100 GPU (40GB).

4 Experiments

4.1 Experimental setups

Table 1 is a brief summary of model size, training data, and training cost.

Data format. Our training data is prepared using scripts publicly released by OWSM v3.1 (Peng et al., 2024). It is a mixture of more than 25 public ASR and ST corpora covering 151 languages and various translation directions. The total audio duration is 180k hours. To create long-form data, consecutive utterances from the same audio recording are concatenated to a duration of no more than 30 seconds. The input audio to the model is always padded to a fixed length of 30 seconds. Appendix A.1 and Table 11 present the training data statistics. The original Whisper-style data contains the start and end timestamps for each utterance. These timestamp tokens are predicted along with normal text tokens during the autoregressive decoding. In OWSM-CTC, we do not include any explicit timestamps since the time-aligned hypothesis can be obtained by forced alignment if desired.

Model architecture. Our speech encoder is a 27-layer E-Branchformer with a hidden size of 1024 and 16 attention heads. Four intermediate layers (6, 12, 15, and 21) are used for self-conditioned CTC. The first three are ASR only, while the others are task-specific. The prompt encoder is a 4-layer Transformer with a hidden size of 512 and 8 attention heads. It is injected into the speech encoder at every third layer. The total model size is 1.01B, which matches the size of the encoder-decoder OWSM v3.1 medium (1.02B). More details about the architecture are in Appendix A.2 (see Table 12).

Implementation. We implement OWSM-CTC in ESPnet (Watanabe et al., 2018) based on PyTorch (Paszke et al., 2019). FlashAttention (Dao et al., 2022) is used to improve training efficiency, but it is not used for inference. The batch size per GPU is 4, and 64 NVIDIA A100 GPUs (40GB) are used with distributed data parallel. The total training time is approximately 300 hours. For optimization, we employ the Adam optimizer (Kingma and Ba, 2015) with the piece-wise linear learning rate schedule (Peng et al., 2024). The peak learning rate is 2e-4. Other training hyperparameters can be found in Appendix A.3 (see Table 13).

Evaluation. We fairly compare our encoder-only OWSM-CTC with the previously released encoder-decoder OWSM v3.1 models (Peng et al., 2024) since they are trained on the same data. We also show the results of Whisper under the same decoding setup for reference, but we note that they are not comparable with ours due to completely different training data. By default, short-form audio without any text prompt is used, but we also evaluate the long-form ASR performance in Section 4.5 and investigate the effect of text prompts in Section 4.6.

Whisper (encoder-decoder) (Radford et al., 2023)
	Accuracy % ( $\uparrow$ )
base	47.6
small	53.1
medium	54.8
OWSM v3 (encoder-decoder) (Peng et al., 2023e)
medium	81.4
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base	41.9
medium	75.6
OWSM-CTC (ours)
medium	87.6

Table 2: Spoken LID results on the FLEURS test set.

4.2 Language identification

Table 2 presents the LID results on the FLEURS test set (Conneau et al., 2023). Our OWSM-CTC achieves a top-1 accuracy of 87.6%, outperforming the other encoder-decoder models by a large margin. This is likely because spoken LID requires a powerful encoder to extract useful information from the input audio. Our encoder-only model is especially suitable for this type of task.

	CommonVoice en	FLEURS en	LibriSpeech test-clean	LibriSpeech test-other	MLS en	Switchboard eval2000	TEDLIUM	VoxPopuli en	WSJ eval92	Average WER ( $\downarrow$ )	Speed-up ( $\uparrow$ )
Whisper (encoder-decoder) (Radford et al., 2023)
base	25.2	12.4	5.1	12.0	13.4	25.7	6.3	10.2	5.0	12.8	2.40x
small	15.7	9.6	3.3	7.7	9.1	22.2	4.6	8.5	4.3	9.4	1.46x
medium	11.9	6.4	2.8	6.5	10.2	19.4	5.1	7.6	2.9	8.1	0.76x
large-v2	10.5	6.0	4.1	6.1	7.7	24.0	6.0	7.1	3.3	8.3	0.55x
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base	21.5	14.8	3.6	9.1	12.0	22.9	7.8	12.0	5.3	12.1	2.97x
medium	12.6	9.0	2.4	5.0	7.1	16.3	5.1	8.4	3.5	7.7	1.00x
+ beam 5	11.7	8.5	2.7	5.3	6.6	15.5	5.1	8.5	3.4	7.5	0.06x
OWSM-CTC (ours)
medium	12.1	9.9	2.4	5.2	7.3	16.9	4.9	8.6	4.2	7.9	3.63x

Table 3: WER % (

\downarrow

) of English ASR. Speed-up (

\uparrow

) is based on average decoding time. Whisper is trained on 438k hours of English audio, whereas OWSM v3.1 and our OWSM-CTC are trained on only 73k hours. Results of Whisper large-v2 and OWSM v3.1 medium with beam search are shown in gray, which are not comparable with the others due to different model sizes or decoding configurations. Bold: the best result. Underlined: OWSM-CTC outperforms OWSM v3.1 medium.

	MLS es	MLS fr	MLS de	MLS nl	MLS it	MLS pt	MLS pl	AISHELL-1 (zh)	KsponSpeech clean (ko)	KsponSpeech other (ko)	ReazonSpeech (ja)	Average Error Rate ( $\downarrow$ )
data size	11.1	9.8	13.3	2.1	2.6	8.6	4.3	23.4	8.0	8.0	7.1
Whisper (encoder-decoder) (Radford et al., 2023)
base	14.5	25.2	19.9	30.9	32.9	23.5	25.2	39.1	27.0	22.9	54.1	28.7
small	9.1	13.6	11.5	18.2	21.3	13.8	12.5	25.1	24.0	15.4	32.5	17.9
medium	6.1	9.7	8.1	12.2	15.6	8.9	6.8	15.7	17.6	12.8	25.3	12.6
large-v2	4.8	7.0	6.3	9.7	13.2	6.6	5.5	18.3	20.0	13.1	26.8	11.9
data size	2.0	2.5	3.7	1.7	0.7	0.3	0.3	16.3	1.0	1.0	18.9
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base	18.5	24.2	18.7	28.6	33.7	44.9	49.7	12.2	23.8	26.1	11.2	26.5
medium	9.0	12.1	10.8	18.1	20.2	21.6	25.2	6.4	16.7	18.9	7.9	15.2
+ beam 5	8.6	11.2	10.2	17.2	19.1	19.4	23.4	5.9	15.0	17.0	7.8	14.1
OWSM-CTC (ours)
medium	10.3	12.9	11.9	20.4	22.1	23.5	31.6	6.4	14.8	16.5	8.1	16.2

Table 4: Multilingual ASR results. CER % (

\downarrow

) is shown for Chinese (zh), Korean (ko) and Japanese (ja), while WER % (

\downarrow

) is shown for the others. Data sizes are in thousand hours. Results of Whisper large-v2 and OWSM v3.1 medium with beam search are shown in gray, which are not comparable with the others. Bold: the best result. Underlined: OWSM-CTC outperforms OWSM v3.1 medium.

4.3 Speech recognition

Table 3 presents word error rates (WERs) on nine English ASR test sets. Following Peng et al. (2023e, 2024), we leverage greedy decoding and apply the Whisper English text normalizer before scoring.³³3We also report the results of Whisper large-v2 and OWSM v3.1 medium with beam search in gray for reference, but they are not comparable with the others due to different model sizes or decoding configurations. This applies to other tables as well. We record the average decoding time across all English test sets on an NVIDIA A40 GPU and calculate the relative speed-up. Results show that our non-autoregressive OWSM-CTC generally has comparable WERs with the autoregressive OWSM v3.1 medium (average: 7.9 vs. 7.7), both of which have 1B parameters. However, OWSM-CTC achieves 3.63x speed-up due to parallel decoding. Notably, OWSM-CTC is even faster than OWSM v3.1 base, which has only 100M parameters, and our WERs are much lower (average: 7.9 vs. 12.1). Compared to Whisper models trained on significantly more data, our OWSM-CTC is still competitive in many cases, and our inference is much faster. These results demonstrate that OWSM-CTC achieves an excellent trade-off between recognition accuracy and inference efficiency.

Table 4 shows the results of multilingual ASR. We perform greedy decoding and apply the Whisper basic text normalizer before scoring. Our OWSM-CTC is slightly worse than OWSM v3.1 in terms of the average WER/CER (16.2 vs. 15.2). For European languages in MLS (Pratap et al., 2020), OWSM-CTC generally falls behind. But for East Asian languages like Chinese, Japanese, and Korean, OWSM-CTC is on par with or better than OWSM v3.1 medium. This difference might be related to the training data size and tokenization.

Src Lang.	de	es	fr	ca	Ave. ( $\uparrow$ )	Speed-up ( $\uparrow$ )
data size	4.3	6.7	4.5	0.2
Whisper (encoder-decoder) (Radford et al., 2023)
base	11.0	18.9	13.2	9.9	13.3	1.84x
small	23.9	31.8	26.1	21.4	25.8	1.54x
medium	32.0	37.3	33.4	28.8	32.9	0.84x
large-v2	35.2	39.7	35.7	31.2	35.5	0.48x
data size	0.2	0.1	0.3	0.1
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base	7.1	10.3	11.5	9.4	9.6	2.78x
medium	16.7	22.3	22.8	18.8	20.2	1.00x
+ beam 5	18.2	24.5	24.4	21.1	22.1	0.05x
OWSM-CTC (ours)
medium	20.7	27.9	27.5	24.2	25.1	3.35x

Table 5: BLEU (

\uparrow

) of X-to-En ST on CoVoST-2. Data sizes are in thousand hours. Results of Whisper large-v2 and OWSM v3.1 medium with beam search are shown in gray, which are not comparable with the others. Bold: the best result. Underlined: OWSM-CTC outperforms OWSM v3.1 medium.

Tgt Lang.	de	ca	zh	fa	et	mn	tr	ar	sv	lv	sl	ta	ja	id	cy	Ave. ( $\uparrow$ )	Speed-up ( $\uparrow$ )
data size	14.0	0.4	13.7	0.8	0.4	0.4	0.9	0.9	0.4	0.4	0.4	0.4	1.0	0.4	0.4	-	-
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base	15.8	8.3	13.0	3.3	3.1	1.6	2.0	1.7	8.7	2.3	1.3	0.0	10.6	6.1	5.0	5.5	2.39x
medium	26.3	20.4	29.7	10.2	9.6	5.8	7.8	7.2	20.8	8.4	11.0	0.1	21.1	17.2	16.3	14.1	1.00x
+ beam 5	27.3	22.5	31.3	11.1	11.1	6.9	9.1	8.4	22.3	9.9	12.7	0.1	22.3	19.7	17.9	15.5	0.05x
OWSM-CTC (ours)
medium	26.7	24.0	32.9	9.9	11.4	6.2	7.9	8.3	24.5	10.0	14.2	0.1	20.4	22.6	20.6	16.0	4.20x
p-value	0.006	0.001	0.001	0.001	0.001	0.001	0.145	0.001	0.001	0.001	0.001	0.031	0.001	0.001	0.001	-	-

Table 6: BLEU (

\uparrow

) of En-to-X ST on CoVoST-2. Data sizes are in thousand hours. Note that Whisper does not support En-to-X translation. The p-values are computed by comparing OWSM-CTC against OWSM v3.1 medium using the Paired Significance Test in SacreBLEU (Post, 2018). Results of OWSM v3.1 medium with beam search are shown in gray, which are not comparable with the others.

4.4 Speech translation

We evaluate ST on CoVoST-2 test sets (Wang et al., 2020a). By default, we perform greedy decoding and calculate BLEU scores in true case with punctuation.⁴⁴4Results in lowercase without punctuation can be found in Appendix C, which are consistent with previous OWSM work (Peng et al., 2024). For X-to-En translation, we follow OWSM v3.1 (Peng et al., 2024) to report results of directions where the training data size is over 100 hours. For the other low-resource directions, both OWSM v3.1 and our OWSM-CTC do not work in general. For En-to-X translation, we report all 15 directions. We calculate the speed-up based on the average decoding time on an NIVIDA A40 GPU.

Table 5 shows the X-to-En results. Notably, our encoder-only OWSM-CTC consistently outperforms the encoder-decoder OWSM v3.1 by a large margin. The average BLEU score is improved from 20.2 to 25.1 (24% relatively). We also achieve 3.35x speed-up for inference.

Table 6 presents En-to-X results. Whisper does not support these directions. Our OWSM-CTC achieves superior performance than OWSM v3.1 in 12 of 15 translation directions and most of them are statistically significant. The average BLEU is improved from 14.1 to 16.0 (13% relatively), and the inference speed-up is 4.20 times.

We have the following observations from the ST results: (1) Our non-autoregressive OWSM-CTC generally achieves 3 to 4 times speed-up compared to the encoder-decoder baseline, which is consistent with ASR. (2) OWSM-CTC even improves the ST performance sometimes by a large margin. One reason is that the autoregressive model suffers from hallucination and error propagation, while the non-autoregressive model is more stable. (3) The BLEU improvement of X-to-En is larger than that of En-to-X, likely because: (i) the OWSM training set contains lots of English ASR data and OWSM-CTC might obtain strong capability of generating English text; (ii) X-to-En has fewer training data than En-to-X, and the encoder-decoder model may need a sufficient amount of training data to achieve good performance for translation.

Our findings reveal that large-scale CTC-based models are also promising for ST in various language pairs, which is consistent with prior investigations at smaller scales (Yan et al., 2023).

Whisper (encoder-decoder) (Radford et al., 2023)
	Context Length	WER % ( $\downarrow$ )	Speed-up ( $\uparrow$ )
base	-	5.3	1.40x
small	-	4.4	1.62x
medium	-	3.8	0.86x
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base	-	9.6	1.40x
medium	-	5.7	1.00x
OWSM-CTC (ours)
medium	2s	5.4	22.40x
	4s	5.2	19.35x
	6s	5.2	16.07x
	8s	5.2	12.09x

Table 7: Long-form ASR results on the TEDLIUM (Hernandez et al., 2018) test set which consists of 11 audio recordings ranging from 6 to 27 minutes. Bold: the best result. Underlined: OWSM-CTC outperforms OWSM v3.1 medium.

4.5 Long-form speech recognition

For long-form ASR, a model takes as input an unsegmented audio recording of arbitrary length and generates the entire transcription without explicit voice activity detection. Whisper and encoder-decoder OWSM can predict start and end timestamps of each utterance within a fixed-length segment. Those timestamps are used to shift the recognition window for chunk-wise long-form ASR. However, this chunk-wise recognition is a sequential process because the location of the next chunk depends on the predicted timestamp in the current chunk.⁵⁵5The decoding process might be parallelized if token-level timestamps are available. However, it remains an open problem to derive accurate token-level timestamps from an attention-based encoder-decoder model without extra training. By contrast, our OWSM-CTC performs chunk-wise recognition in a fully parallel manner. We first split the entire audio into overlapped chunks of 30s, where the overlapped region serves as the left and right context.⁶⁶6We follow this tutorial for long-form ASR with CTC: https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Streaming_ASR.ipynb We then perform CTC greedy decoding on batched chunks. The batch size is 32 on a single NVIDIA A40 GPU (48GB). Table 7 shows the WER and speed-up with different context lengths. Our OWSM-CTC achieves lower WERs than the encoder-decoder OWSM v3.1, while being approximately 20 times faster due to the batched parallel decoding. OWSM-CTC is also robust to different context lengths. These observations indicate that CTC-based non-autoregressive models perform very well for long-form ASR, which is consistent with prior findings (Koluguri et al., 2023).

	GigaSpeech	LS-clean	LS-other	SWBD	TEDLIUM	AISHELL
w/o prev	11.80	2.42	5.22	16.92	4.95	6.37
w/ prev	11.23	2.38	5.10	16.70	4.55	6.25
p-value	<0.001	0.19	0.007	<0.001	<0.001	<0.001

Table 8: Using the previous sentence as a text prompt improves the ASR WER/CER of OWSM-CTC.

4.6 Effect of text prompt

As described in Figure 2 and Section 3.3, OWSM-CTC can take an additional text prompt as input which might change the output. During training, either a special token <na> or the previous sentence in the same audio is used as the prompt according to a probability of 0.5, which follows the setup of Whisper and OWSM. To verify that OWSM-CTC can utilize information from the prompt when necessary, we perform greedy decoding on several test sets with the previous sentence in the dataset as a prompt. As shown in Table 8, using the previous sentence reduces the error rates. The p-values are computed using the Matched Pair Sentence Segment method.⁷⁷7https://github.com/usnistgov/SCTK Appendix D provides an example where the previous sentence also affects the output text style.

Input length	5s	10s	20s
Whisper (encoder-decoder) (Radford et al., 2023)
large-v3	Fjell	Fusilet	Rekordverk
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
medium	thank you	thank you	(Applause)
OWSM-CTC (ours)
medium	.	(	( )

Table 9: ASR outputs with random noise as input.

4.7 Robustness

To investigate the robustness, we first consider random noise as input. Table 9 shows the ASR outputs generated by three models. Encoder-decoder models, including Whisper and OWSM v3.1, tend to generate some texts that look meaningful, while our OWSM-CTC generates fewer tokens, which are mostly punctuation marks that do not actually have meaning.

Another typical issue of autoregressive decoding is that the generation might fall into repetitions of a few characters or words. Table 19 in Appendix E presents two examples from ASR and ST, respectively. Our non-autoregressive model is more robust in such cases. To quantitatively measure this type of error, we consider a hypothesis as a failure if it contains any character-level $\theta$ -gram ( $\theta=1,2,\dots,\theta_{\text{max}}$ ) that consecutively occurs for at least $\delta$ times. Table 10 shows the number of failures in all ST test sets with different thresholds. We can see that the encoder-decoder OWSM v3.1 medium fails many times even with beam search, while our OWSM-CTC has almost no failures.

$\theta_{\text{max}}$	$\delta$	Model	#Failures ( $\downarrow$ )
10	5	OWSM v3.1	2448
		OWSM v3.1 (beam 5)	630
		OWSM-CTC (ours)	3
20	5	OWSM v3.1	2537
		OWSM v3.1 (beam 5)	672
		OWSM-CTC (ours)	3
20	6	OWSM v3.1	1985
		OWSM v3.1 (beam 5)	453
		OWSM-CTC (ours)	1

Table 10: Comparison of the number of decoding failures in all ST test sets. There are 286k samples in total.

5 Conclusion

We propose OWSM-CTC, a novel encoder-only speech foundation model built upon 180k hours of public audio data and open-source toolkits. OWSM-CTC employs multi-task self-conditioned CTC for multilingual ASR, any-to-any ST, and LID. We conduct extensive experiments to compare OWSM-CTC with the encoder-decoder OWSM models trained on the same data. We find that OWSM-CTC achieves competitive performance on ASR and superior performance on ST for both X-to-En (24% relative improvement) and En-to-X (13% relative improvement), while being more robust and 3 to 4 times faster at inference time. Additionally, OWSM-CTC improves the long-form ASR WER with 20 times faster inference due to the batched parallel decoding. OWSM-CTC also outperforms the baselines on LID. To promote open research on large speech models, we will publicly release our code, pre-trained model weights and training logs.

Limitations

Although OWSM-CTC reduces the training cost by 22% compared to OWSM v3.1, it still requires nearly 20k GPU hours, which is nontrivial. OWSM-CTC can generate incorrect ASR or ST outputs due to limited training data in certain languages. Care should be taken when using our model for low-resource ASR or ST. Besides, we have only evaluated our model with greedy decoding as it has the fastest inference speed. The non-autoregressive model sometimes makes mistakes in spelling or grammar due to a lack of language models.

Broader Impacts and Ethics

Our OWSM-CTC is a novel encoder-only speech foundation model built upon public datasets and open-source toolkits. Compared to other popular choices, it achieves very strong performance and efficiency. We adhere to the ACL ethics policy and there is no violation of privacy in our experiments. We plan to publicly release all scripts, pre-trained models, and training logs, which can promote transparency and open science. We believe this will benefit the entire speech research community and it can make the latest speech technology available to a broader range of people all over the world.

Acknowledgements

Our computing resources are supported by PSC Bridges2 and NCSA Delta via ACCESS allocation CIS210014, under National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

References

Anil et al. (2023a) Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, and Julian Schrittwieser et al. 2023a. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805.
Anil et al. (2023b) Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yan** Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yu**g Zhang, Gustavo Hernández Ábrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan A. Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vladimir Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, and et al. 2023b. Palm 2 technical report. CoRR, abs/2305.10403.
Ardila et al. (2020) Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. Common Voice: A massively-multilingual speech corpus. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 4218–4222. European Language Resources Association.
Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Bang et al. (2020) Jeong-Uk Bang, Seung Yun, Seung-Hi Kim, Mu-Yeol Choi, Min-Kyu Lee, Yeo-Jeong Kim, Dong-Hyun Kim, Jun Park, Young-Jik Lee, and Sang-Hun Kim. 2020. KsponSpeech: Korean Spontaneous Speech Corpus for Automatic Speech Recognition. Applied Sciences, 10(19).
Barrault et al. (2023) Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia Gonzalez, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-jussà, Maha Elbayad, Hongyu Gong, Francisco Guzmán, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alexandre Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, and Mary Williamson. 2023. Seamless: Multilingual expressive and streaming speech translation. CoRR, abs/2312.05187.
Beltagy et al. (2020) Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. CoRR, abs/2004.05150.
Bu et al. (2017) Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. AISHELL-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pages 1–5.
Burchi and Vielzeuf (2021) Maxime Burchi and Valentin Vielzeuf. 2021. Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021, pages 8–15. IEEE.
Carletta (2007) Jean Carletta. 2007. Unleashing the killer corpus: experiences in creating the multi-everything AMI meeting corpus. Lang. Resour. Evaluation, 41(2):181–190.
Cattoni et al. (2021) Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2021. MuST-C: A multilingual corpus for end-to-end speech translation. Comput. Speech Lang., 66:101155.
Chang et al. (2022) Heng-Jui Chang, Shu-Wen Yang, and Hung-yi Lee. 2022. Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit BERT. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pages 7087–7091. IEEE.
Chen et al. (2021) Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie **, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhao You, and Zhiyong Yan. 2021. GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 3670–3674. ISCA.
Chen et al. (2019) Nanxin Chen, Shinji Watanabe, Jesús Villalba, and Najim Dehak. 2019. Listen and fill in the missing letters: Non-autoregressive transformer for speech recognition. CoRR, abs/1911.04908.
Chi et al. (2021) Ethan A. Chi, Julian Salazar, and Katrin Kirchhoff. 2021. Align-refine: Non-autoregressive speech recognition via iterative realignment. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1920–1927, Online. Association for Computational Linguistics.
Chuang et al. (2021) Shun-Po Chuang, Yung-Sung Chuang, Chih-Chiang Chang, and Hung-yi Lee. 2021. Investigating the reordering capability in CTC-based non-autoregressive end-to-end speech translation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1068–1077, Online. Association for Computational Linguistics.
Conneau et al. (2023) Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. 2023. FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805.
Dao et al. (2022) Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
Ding et al. (2023) Shao** Ding, David Qiu, David Rim, Yanzhang He, Oleg Rybakov, Bo Li, Rohit Prabhavalkar, Weiran Wang, Tara N. Sainath, Shivani Agrawal, Zhonglin Han, Jian Li, and Amir Yazdanbakhsh. 2023. USM-Lite: Quantization and sparsity aware fine-tuning for speech recognition with universal speech models. CoRR, abs/2312.08553.
Gandhi et al. (2023) Sanchit Gandhi, Patrick von Platen, and Alexander M. Rush. 2023. Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling. CoRR, abs/2311.00430.
Ghazvininejad et al. (2019) Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6112–6121, Hong Kong, China. Association for Computational Linguistics.
Godfrey et al. (1992) John J. Godfrey, Edward Holliman, and Jane McDaniel. 1992. SWITCHBOARD: telephone speech corpus for research and development. In 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’92, San Francisco, California, USA, March 23-26, 1992, pages 517–520. IEEE Computer Society.
Gong et al. (2023) Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James R. Glass. 2023. Listen, think, and understand. CoRR, abs/2305.10790.
Graves (2012) Alex Graves. 2012. Sequence transduction with recurrent neural networks. CoRR, abs/1211.3711.
Graves et al. (2006) Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, volume 148 of ACM International Conference Proceeding Series, pages 369–376. ACM.
Gu et al. (2018) Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. 2018. Non-autoregressive neural machine translation. In International Conference on Learning Representations.
Gupta et al. (2024) Ankit Gupta, George Saon, and Brian Kingsbury. 2024. Exploring the limits of decoder-only models trained on public speech recognition corpora. CoRR, abs/2402.00235.
Hernandez et al. (2018) François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia A. Tomashenko, and Yannick Estève. 2018. TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer - 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18-22, 2018, Proceedings, volume 11096 of Lecture Notes in Computer Science, pages 198–208. Springer.
Higuchi et al. (2021) Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, and Shinji Watanabe. 2021. A comparative study on non-autoregressive modelings for speech-to-text generation. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021, pages 47–54. IEEE.
Higuchi et al. (2020) Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, and Tetsunori Kobayashi. 2020. Mask CTC: non-autoregressive end-to-end ASR with CTC and mask predict. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pages 3655–3659. ISCA.
Inaguma et al. (2021) Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, and Shinji Watanabe. 2021. ORTHROS: non-autoregressive end-to-end speech translation with dual-decoder. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7503–7507.
Kim et al. (2023) Kwangyoun Kim, Felix Wu, Yifan Peng, **g Pan, Prashant Sridhar, Kyu J. Han, and Shinji Watanabe. 2023. E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 84–91.
Kim et al. (2022) Sehoon Kim, Amir Gholami, Albert E. Shaw, Nicholas Lee, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, and Kurt Keutzer. 2022. Squeezeformer: An efficient transformer for automatic speech recognition. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
Kim et al. (2017) Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4835–4839.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Koluguri et al. (2023) Nithin Rao Koluguri, Samuel Kriman, Georgy Zelenfroind, Somshubra Majumdar, Dima Rekesh, Vahid Noroozi, Jagadeesh Balam, and Boris Ginsburg. 2023. Investigating end-to-end ASR architectures for long form audio transcription. CoRR, abs/2309.09950.
Lai et al. (2021) Cheng-I Jeff Lai, Yang Zhang, Alexander H. Liu, Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David D. Cox, and Jim Glass. 2021. PARP: prune, adjust and re-prune for self-supervised speech recognition. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 21256–21272.
Lee and Watanabe (2021) Jaesong Lee and Shinji Watanabe. 2021. Intermediate loss regularization for CTC-based speech recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6224–6228.
Lee et al. (2022) Yeonghyeon Lee, Kangwook Jang, Jahyun Goo, Youngmoon Jung, and Hoi Rin Kim. 2022. FitHuBERT: Going thinner and deeper for knowledge distillation of speech self-supervised models. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 3588–3592. ISCA.
Nawrot et al. (2023) Piotr Nawrot, Jan Chorowski, Adrian Lancucki, and Edoardo Maria Ponti. 2023. Efficient transformers with dynamic token pooling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6403–6417, Toronto, Canada. Association for Computational Linguistics.
Ng et al. (2021) Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, and William Chan. 2021. Pushing the limits of non-autoregressive speech recognition. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 3725–3729. ISCA.
Nozaki and Komatsu (2021) Jumon Nozaki and Tatsuya Komatsu. 2021. Relaxing the conditional independence assumption of CTC-based ASR by conditioning on intermediate predictions. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 3735–3739. ISCA.
O’Neill et al. (2021) Patrick K. O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, and Georg Kucsko. 2021. SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 1434–1438. ISCA.
OpenAI (2023) OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
Panayotov et al. (2015) Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, pages 5206–5210. IEEE.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8024–8035.
Peng et al. (2022) Yifan Peng, Siddharth Dalmia, Ian R. Lane, and Shinji Watanabe. 2022. Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 17627–17643. PMLR.
Peng et al. (2023a) Yifan Peng, Kwangyoun Kim, Felix Wu, Prashant Sridhar, and Shinji Watanabe. 2023a. Structured pruning of self-supervised pre-trained models for speech recognition and understanding. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
Peng et al. (2023b) Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, and Shinji Watanabe. 2023b. A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks. In Proc. INTERSPEECH 2023, pages 2208–2212.
Peng et al. (2023c) Yifan Peng, Jaesong Lee, and Shinji Watanabe. 2023c. I3D: transformer architectures with input-dependent dynamic depth for speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE.
Peng et al. (2023d) Yifan Peng, Yui Sudo, Shakeel Muhammad, and Shinji Watanabe. 2023d. DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models. In Proc. INTERSPEECH 2023, pages 62–66.
Peng et al. (2024) Yifan Peng, **chuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, and Shinji Watanabe. 2024. OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer. CoRR, abs/2401.16658.
Peng et al. (2023e) Yifan Peng, **chuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-Weon Jung, Soumi Maiti, and Shinji Watanabe. 2023e. Reproducing whisper-style training using an open-source toolkit and publicly available data. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8.
Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
Post et al. (2013) Matt Post, Gaurav Kumar, Adam Lopez, Damianos Karakos, Chris Callison-Burch, and Sanjeev Khudanpur. 2013. Improved speech-to-text translation with the fisher and callhome Spanish-English speech translation corpus. In Proceedings of the 10th International Workshop on Spoken Language Translation: Papers, Heidelberg, Germany.
Pratap et al. (2023) Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. 2023. Scaling speech technology to 1, 000+ languages. CoRR, abs/2305.13516.
Pratap et al. (2020) Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. 2020. MLS: A large-scale multilingual dataset for speech research. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pages 2757–2761. ISCA.
Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
Rekesh et al. (2023) Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Oleksii Hrinchuk, Krishna Puvvada, Ankur Kumar, Jagadeesh Balam, and Boris Ginsburg. 2023. Fast conformer with linearly scalable attention for efficient speech recognition. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8.
Rubenstein et al. (2023) Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara N. Sainath, Johan Schalkwyk, Matthew Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirovic, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats, Neil Zeghidour, Yu Zhang, Zhishuai Zhang, Lukas Zilka, and Christian Havnø Frank. 2023. AudioPaLM: A large language model that can speak and listen. CoRR, abs/2306.12925.
Strimel et al. (2023) Grant Strimel, Yi Xie, Brian John King, Martin Radfar, Ariya Rastrow, and Athanasios Mouchtaris. 2023. Lookahead when it matters: Adaptive non-causal transformers for streaming neural transducers. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32654–32676. PMLR.
Tay et al. (2023) Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2023. Efficient transformers: A survey. ACM Comput. Surv., 55(6):109:1–109:28.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
Wang et al. (2021) Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. 2021. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 993–1003, Online. Association for Computational Linguistics.
Wang et al. (2020a) Changhan Wang, Anne Wu, and Juan Miguel Pino. 2020a. CoVoST 2: A massively multilingual speech-to-text translation corpus. CoRR, abs/2007.10310.
Wang et al. (2023a) Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul K. Rubenstein, Lukas Zilka, Dian Yu, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, and Yonghui Wu. 2023a. SLM: Bridge the thin gap between speech and text foundation models. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8.
Wang et al. (2020b) Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020b. Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768.
Wang et al. (2023b) Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, **yu Li, and Furu Wei. 2023b. VioLA: Unified codec language models for speech recognition, synthesis, and translation. CoRR, abs/2305.16107.
Watanabe et al. (2018) Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai. 2018. ESPnet: End-to-end speech processing toolkit. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, pages 2207–2211. ISCA.
Xiao et al. (2023) Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, and Tie-Yan Liu. 2023. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Trans. Pattern Anal. Mach. Intell., 45(10):11407–11427.
Xu et al. (2023) Chen Xu, Xiaoqian Liu, Xiaowen Liu, Qingxuan Sun, Yuhao Zhang, Murun Yang, Qianqian Dong, Tom Ko, Mingxuan Wang, Tong Xiao, Anxiang Ma, and **gbo Zhu. 2023. CTC-based non-autoregressive speech translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13321–13339, Toronto, Canada. Association for Computational Linguistics.
Yan et al. (2023) Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, and Shinji Watanabe. 2023. CTC alignments improve autoregressive translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1623–1639, Dubrovnik, Croatia. Association for Computational Linguistics.
Ye et al. (2022) Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao Wang, Mingxuan Wang, and Jun Cao. 2022. GigaST: A 10,000-hour pseudo speech translation corpus. CoRR, abs/2204.03939.
Yeh et al. (2023) Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, and Abdelrahman Mohamed. 2023. Efficient speech representation learning with low-bit quantization. CoRR, abs/2301.00652.
Yin et al. (2023) Seiji Fujimoto Yue Yin, Daijiro Mori, and S Fujimoto. 2023. ReazonSpeech: A free and massive corpus for Japanese ASR. In Annual meetings of the Association for Natural Language Processing.
Yoon et al. (2022) Ji Won Yoon, Beom Jun Woo, and Nam Soo Kim. 2022. HuBERT-EE: Early exiting hubert for efficient speech recognition. CoRR, abs/2204.06328.
Zhang et al. (2022) Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, and Zhendong Peng. 2022. WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6182–6186.
Zhang et al. (2023) Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara N. Sainath, Pedro J. Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, and Yonghui Wu. 2023. Google USM: scaling automatic speech recognition beyond 100 languages. CoRR, abs/2303.01037.

Appendix A Details of Experimental Setups

Model	Unlabeled	English ASR	Other ASR	ST	Languages	Vocabulary Size
Whisper (Radford et al., 2023)
Initial versions	-	438k hours	117k hours	125k hours	99	52k
large-v3	4M hours	1M hours of labeled in total			100	52k
OWSM v3.1 (Peng et al., 2024)
	-	73k hours	67k hours	40k hours	151	50k
OWSM-CTC (ours)
	-	73k hours	67k hours	40k hours	151	50k

Table 11: Details of training data. Our data is prepared using the scripts released by OWSM v3.1 (Peng et al., 2024).

Model	Params	Encoder	Decoder	Layers	Hidden Size	Attention Heads	Time Shift
Whisper (Radford et al., 2023)
tiny	39M	Transformer	Transformer	4	384	6	20ms
base	74M	Transformer	Transformer	6	512	8	20ms
small	244M	Transformer	Transformer	12	768	12	20ms
medium	769M	Transformer	Transformer	24	1024	16	20ms
large	1.55B	Transformer	Transformer	32	1280	20	20ms
large-v3	1.55B	Transformer	Transformer	32	1280	20	20ms
OWSM v3.1 (Peng et al., 2024)
base	101M	E-Branchformer	Transformer	6	384	6	40ms
medium	1.02B	E-Branchformer	Transformer	18	1024	16	40ms
OWSM-CTC (ours)
medium	1.01B	E-Branchformer	-	27	1024	16	80ms

Table 12: Details of model architectures. Whisper (Radford et al., 2023) and OWSM v3.1 (Peng et al., 2024) are encoder-decoder models, whereas our OWSM-CTC is an encoder-only model. We mostly follow the design of OWSM v3.1 medium, but we increase the number of encoder layers to match the overall model size.

OWSM v3.1 (Peng et al., 2024)
Model	Batch Size	Total Steps	Warmup Steps	Max Learning Rate	InterCTC Layers $\mathcal{S}$
base	256	675k	60k	1e-3	-
medium	256	675k	60k	2e-4	-
OWSM-CTC (ours)
medium	256	600k	60k	2e-4	6, 12, 15, 21

Table 13: Training hyperparameters. We mostly follow the training setups of OWSM v3.1 medium (Peng et al., 2024). As described in Section 3.2, we employ self-conditioned CTC at four intermediate layers.

Downsampling Strategy	Params	GPU VRAM ( $\downarrow$ )	Speed-up ( $\uparrow$ )	ASR WER ( $\downarrow$ )	ST BLEU ( $\uparrow$ )
4x in CNN	55M	38GB	1.00x	8.3	22.0
6x in CNN	55M	22GB	1.12x	8.6	21.3
8x in CNN	55M	19GB	1.13x	8.8	21.5
4x in CNN + 2x in the middle of Encoder	55M	38GB	1.03x	9.7	21.6

Table 14: Comparison of different downsampling strategies on MuST-C v2 En-De. The other configurations, such as batch size, are kept the same. Using 4x downsampling achieves the best ASR and ST results, while using 8x downsampling significantly reduces the GPU memory usage, which enables a larger batch size per GPU. We employ 8x downsampling in our large-scale OWSM-CTC to reduce training costs.

ASR-Only CTC Layers	Task-Dependent CTC Layers	ASR WER ( $\downarrow$ )	ST BLEU ( $\uparrow$ )
-	6, 12, 18, 24	diverged
6	12, 18, 24	9.0	21.6
6, 12	18, 24	8.8	21.5
6, 12, 18	24	8.4	21.2

Table 15: Effect of the CTC type. This small-scale model has 24 layers with 8x downsampling in CNN. As described in Section 3.2, we employ self-conditioned CTC at some intermediate layers. These CTC layers can perform a single task like ASR or multiple tasks depending on the task specifier. If we allow all CTC layers to perform multiple tasks (ASR and ST), the model cannot converge from scratch. Therefore, we leverage the first few CTC layers for ASR only and the remaining ones for multi-tasking.

Src Lang.	de	es	fr	ca	Average ( $\uparrow$ )	Speed-up ( $\uparrow$ )
data size	4.3	6.7	4.5	0.2
Whisper (encoder-decoder) (Radford et al., 2023)
base	11.4	19.2	13.1	9.7	13.4	1.84x
small	25.0	32.8	26.4	21.7	26.5	1.54x
medium	33.6	39.7	34.4	29.2	34.2	0.84x
data size	0.2	0.1	0.3	0.1
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base	7.3	10.0	11.1	9.0	9.4	2.78x
medium	17.1	22.3	22.7	18.4	20.1	1.00x
OWSM-CTC (ours)
medium	21.1	28.2	27.7	23.7	25.2	3.35x

Table 16: BLEU (

\uparrow

) of X-to-En ST on CoVoST-2 using lowercase without punctuation. Data sizes are in thousand hours. Bold: the best result. Underlined: our OWSM-CTC outperforms OWSM v3.1 medium.

Tgt Lang.	de	ca	zh	fa	et	mn	tr	ar	sv	lv	sl	ta	ja	id	cy	Average ( $\uparrow$ )	Speed-up ( $\uparrow$ )
data size	14.0	0.4	13.7	0.8	0.4	0.4	0.9	0.9	0.4	0.4	0.4	0.4	1.0	0.4	0.4
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base	14.6	7.7	14.5	3.0	1.8	1.0	1.2	1.6	8.1	1.3	0.7	0.0	8.7	5.1	4.5	4.9	2.39x
medium	25.4	19.6	32.1	10.1	7.7	4.6	6.5	7.2	20.3	6.4	9.0	0.0	19.6	16.1	15.3	13.3	1.00x
OWSM-CTC (ours)
medium	25.5	23.0	35.1	10.0	9.2	4.8	6.8	8.2	23.8	7.7	12.0	0.0	18.5	21.0	19.4	15.0	4.20x

Table 17: BLEU (

\uparrow

) of En-to-X ST on CoVoST-2 using lowercase without punctuation. Data sizes are in thousand hours. Bold: the best result. Underlined: our OWSM-CTC outperforms OWSM v3.1 medium. Note that Whisper does not support En-to-X translation.

Input audio content	Previous sentence	ASR w/o previous	ASR w/ previous
future ’s over here wind sun a new energy grid new investments to create high paying jobs repower america it ’s time to get real there is an old african proverb that says if you want to go quickly go alone if you want to go far go together we need to go far quickly thank you very much	with one hundred percent clean electricity within ten years a plan to put america back to work make us more secure and help stop global warming finally a solution that ’s big enough to solve our problems repower america find out more this is the last one it ’s about repowering america one of the fastest ways to cut our dependence on old dirty fuels that are killing our planet	Future’s over here. Wind, sun. A new energy grid. New investments to create high-pan jobs. Repower America. It’s time to get real. There’s an old African proverb that says, "If you want to go quickly, go alone. if you want to go far, go together." We need to go far quickly. Thank you very much. (Applause)	future ’s over here wind sun a new energy grid new investments to create high pan jobsrepower america it ’s time to get real there ’s an old african proverb that says if you want to go quickly go alone if you want to go far go together we need to go far quickly thank you very much

Table 18: Using a previous sentence as the prompt might change the output style. The optional prompt encoder is defined in Figure 2 and Section 3.3.

Groundtruth reference	OWSM v3.1 output	OWSM-CTC output (ours)
in search of the mythical treasure your grandfather is supposed to have secreted there he laughed and the girl instinctively shuddered with a newborn distrust there was no mirth in the sound	in search of the mythical treasure your grandfather is supposed to have secreted there ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha …	in search of the mythical treasure your grandfather is supposed to have secreted there he laughed and the girl instinctively shuddered with a new-born distrust there was no mirth in the sound
and with her they began a national tour that took them all around the country	they take a national gira which leads to rerererererererererererererere …	with learn a national tour that leads them to run the entire country

Table 19: Autoregressive decoding sometimes gets trapped in a loop in both ASR (row 1, MLS En) and ST (row 2, CoVoST-2 Es-En). Our OWSM-CTC is more robust.

A.1 Training data

Table 11 summarizes the training data statistics. We prepare the training data mixture using the scripts publicly released by OWSM v3.1 (Peng et al., 2024). This ensures a fair comparison between our OWSM-CTC and the previously released encoder-decoder OWSM models.

Our use of the data is consistent with their intended use. These datasets have been widely used in speech research. They do not violate the privacy of creators or users, nor do they contain any offensive content. Specifically, the individual training datasets and licenses are listed below: AIDATATANG (CC BY-NC-ND 4.0)⁸⁸8https://www.openslr.org/62/, AISHELL-1 (Apache 2.0) Bu et al. (2017), AMI (CC BY 4.0) Carletta (2007), Babel⁹⁹9https://www.iarpa.gov/research-programs/babel, CommonVoice (CC0-1.0) Ardila et al. (2020), CoVoST2 (CC BY-NC 4.0) Wang et al. (2020a), Fisher Switchboard (LDC) Godfrey et al. (1992), Fisher Callhome Spanish (LDC) Post et al. (2013), FLEURS (CC-BY-4.0) Conneau et al. (2023), Googlei18n¹⁰¹⁰10Resources 32, 35, 36, 37, 41, 42, 43, 44, 52, 53, 54, 61, 63, 64, 65, 66, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, and 86 from openslr.org., GigaSpeech (Apache 2.0) Chen et al. (2021), GigaST (CC BY-NC 4.0) Ye et al. (2022), KsponSpeech (MIT License) Bang et al. (2020), LibriSpeech (CC BY 4.0) Panayotov et al. (2015), Multilingual LibriSpeech (CC BY 4.0) Pratap et al. (2020), MagicData (CC BY-NC-ND 4.0)¹¹¹¹11https://openslr.org/68/, MuST-C (CC BY NC ND 4.0 International) Cattoni et al. (2021), SPGISpeech O’Neill et al. (2021), TEDLIUM3 (CC BY-NC-ND 3.0) Hernandez et al. (2018), ReazonSpeech (Apache 2.0 / CDLA-Sharing-1.0) Yin et al. (2023), Russian OpenSTT (CC-BY-NC)¹²¹²12https://github.com/snakers4/open_stt, VCTK (CC BY 4.0)¹³¹³13https://huggingface.co/datasets/vctk, VoxForge (GPL)¹⁴¹⁴14https://www.voxforge.org/, VoxPopuli (Attribution-NonCommercial 4.0 International) Wang et al. (2021), WenetSpeech (Creative Commons Attribution 4.0 International License) Zhang et al. (2022).

A.2 Model architectures

Table 12 shows the model configurations. Our OWSM-CTC mostly follows the design of OWSM v3.1 medium (Peng et al., 2024), but we only use an encoder. To match the total model size, we increase the number of layers to 27, leading to a total of 1B parameters. Note that the sequence length of the encoder is usually longer than that of the decoder. Hence, the encoder-only model can have a higher computational cost than the encoder-decoder model. To alleviate this issue, we apply a larger downsampling rate in the CNN module to reduce the sequence length. Our final time shift is 80ms, as opposed to 40ms of the encoder-decoder OWSM models. We observe that our training time for a fixed number of updates is roughly the same as that of OWSM v3.1 medium. We also investigated different downsampling strategies at a smaller scale, as discussed in Appendix B.1 and Table 14.

A.3 Training hyperparameters

Table 13 presents the training hyperparameters of OWSM v3.1 and our OWSM-CTC. Again, we follow the previous OWSM v3.1 (Peng et al., 2024) for a fair comparison, except that we adopt self-conditioned CTC (Nozaki and Komatsu, 2021) at four intermediate layers (see Section 3.2).

Appendix B Small-Scale Ablation Studies

Before the large-scale training using the entire 180k hours of audio data, we conducted preliminary experiments on MuST-C v2 En-De (Cattoni et al., 2021) to investigate the effect of the CNN downsampling rate and the choice of the task for intermediate CTC layers. Specifically, we train 24-layer E-Branchformer-CTC models on the combined ASR and ST data from MuST-C v2 En-De. The input is always English audio, but the output can be the English ASR transcript or its German translation depending on the task specifier (see Figure 2).

B.1 Effect of downsampling strategies

Table 14 compares different downsampling strategies while the other configurations are kept the same. The attention is implemented with FlashAttention (Dao et al., 2022). Self-conditioned CTC is applied at three intermediate layers: 6, 12, and 18. The first two CTC layers always perform ASR, while the others are task-dependent. The results show that using 8x downsampling in the CNN module leads to a slight degradation on WER and BLEU but reduces the GPU memory usage by half. We thus decide to employ 8x downsampling in our large-scale OWSM-CTC, enabling a doubled batch size per GPU. As mentioned in Appendix A.2, with this strategy, we observe a similar training speed compared to the encoder-decoder OWSM model.

B.2 Choice of the CTC task

As discussed in Section 3.2, the intermediate CTC layers can be configured to perform a specific task like ASR or multiple tasks depending on the input task token. Table 15 compares different choices at a small scale using MuST-C v2 En-De. If all CTC layers are task-dependent (i.e., multi-tasking), the model cannot converge when trained from scratch. As more layers are used for ASR only, the ASR WER improves, but the ST BLEU decreases slightly. A good trade-off is to use the first half for ASR only and the second half for multi-tasking. Therefore, in our large-scale OWSM-CTC with 27 layers, we configure the 6th, 12th, and 15th layers to perform ASR only and the other two CTC layers (i.e., 21st and 27th layers) to be multi-tasking. This design also mimics the conventional cascaded system for ST.

Appendix C More Results of ST

Section 4.4 shows the BLEU scores using true case with punctuation. In this section, Table 16 and Table 17 present BLEU in lowercase without punctuation, which is consistent with the setup in prior work (Peng et al., 2024). The findings are very consistent with those in Section 4.4. Our OWSM-CTC achieves higher BLEU scores with faster inference speeds than the encoder-decoder OWSM v3.1 in general.

Appendix D Effect of text prompt

Table 18 presents an example from TEDLIUM, where the text prompt changes the output style. When there is no prompt, the ASR output of OWSM-CTC is in true case with punctuation, and the apostrophes are combined with the previous words. However, when the previous sentence is used as a prompt, the style of the ASR hypothesis becomes more similar to that of the prompt. Specifically, the text is now in lowercase without punctuation marks, and the apostrophes are separate from previous words. This style is closer to the groundtruth transcript.

Although the above example looks promising for biasing the model’s output toward certain directions, we note that this is not guaranteed to work in a zero-shot manner. We have also tried a few examples for zero-shot contextual biasing, where we provide a few biasing words in the prompt (e.g., person names), but we find that the model may not generate the correct word in many cases. This is mainly because the model is not really trained to perform this type of task - we just provide the previous sentence (according to some probability) as the prompt during training, which might not be useful at all; thus, the non-autoregressive model can simply ignore it in most cases. A more practical way to utilize this feature is to fine-tune our pre-trained model using some carefully designed data for contextual biasing. We will explore this in the future.

Appendix E Robustness

Table 19 shows that autoregressive decoding sometimes fails to generate the correct output for either ASR or ST, while non-autoregressive decoding is generally more robust to this type of error.