OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Yifan Peng
Carnegie Mellon University
[email protected]
&Yui Sudo
Honda Research Institute Japan
[email protected]
\ANDMuhammad Shakeel
Honda Research Institute Japan
[email protected]
&Shinji Watanabe
Carnegie Mellon University
[email protected]
Abstract

There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID). Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 24% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference. OWSM-CTC also improves the long-form ASR result with 20x speed-up. We will publicly release our code, pre-trained model, and training logs to promote open science in speech foundation models.111https://github.com/espnet/espnet

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification


Yifan Peng Carnegie Mellon University [email protected]                        Yui Sudo Honda Research Institute Japan [email protected]


Muhammad Shakeel Honda Research Institute Japan [email protected]                        Shinji Watanabe Carnegie Mellon University [email protected]


1 Introduction

111122223333444488881010101012121212mediumbaseSpeed-up (\rightarrow)WER (\leftarrow)OWSM v3.1 seriesOWSM-CTC (ours)
(a) English speech recognition
1111222233334444101010101515151520202020252525253030303035353535mediumbaseSpeed-up (\rightarrow)BLEU (\rightarrow)OWSM v3.1 seriesOWSM-CTC (ours)
(b) X-to-En speech translation
1111222233334444555510101010151515152020202025252525mediumbaseSpeed-up (\rightarrow)BLEU (\rightarrow)OWSM v3.1 seriesOWSM-CTC (ours)
(c) En-to-X speech translation
Figure 1: Performance vs. speed for encoder-decoder OWSM v3.1 and our encoder-only OWSM-CTC.

The great success of large language models (LLMs) (OpenAI, 2023; Touvron et al., 2023; Anil et al., 2023b) has sparked a growing interest in develo** foundation models in various modalities. Recent studies have explored different approaches towards multilingual and multi-tasking speech foundation models (Radford et al., 2023; Zhang et al., 2023; Pratap et al., 2023; Rubenstein et al., 2023; Barrault et al., 2023; Peng et al., 2023e). OpenAI Whisper (Radford et al., 2023) is a series of Transformer encoder-decoder models trained on 680k hours of proprietary labeled audio. Whisper achieves strong results in multilingual automatic speech recognition (ASR), any-to-English speech translation (ST), and spoken language identification (LID). Although it shows the effectiveness of large-scale (weakly) supervised pre-training, the full development pipeline, including training data details, is not publicly accessible. Recent works have developed Open Whisper-style Speech Models (OWSM) (Peng et al., 2023e, 2024) with the aim of reproducing Whisper-style training using public data and open-source toolkits. However, Whisper and OWSM adopt the encoder-decoder architecture, which generates text tokens given speech in an autoregressive manner. They might hallucinate during inference, and the speed can be slow. Other models with decoder-only architectures, like AudioPaLM (Rubenstein et al., 2023) and VioLA (Wang et al., 2023b), could suffer from the same issues due to autoregressive decoding.

Another type of work like Google USM (Zhang et al., 2023) and Meta MMS (Pratap et al., 2023) uses non-autoregressive models with Connectionist Temporal Classification (CTC) Graves et al. (2006), but these CTC-based models are designed for ASR only. Prior studies have also achieved promising results of CTC models for ST only, but they mainly focus on specific language pairs at much smaller scales (Inaguma et al., 2021; Chuang et al., 2021; Xu et al., 2023). Some of them employ additional decoders (Inaguma et al., 2021; Yan et al., 2023) or cross-attention layers (Xu et al., 2023), making the model more complicated.

A natural question now arises: Can we build a non-autoregressive encoder-only model for speech-to-text generation in diverse languages and multiple tasks like Whisper/OWSM? This research problem has become increasingly important in the era of LLMs because large-scale pre-trained speech encoders can serve as an adapter between the speech and text modalities (Gong et al., 2023; Wang et al., 2023a), providing a promising avenue towards general-purpose multi-modal foundation models (Anil et al., 2023a).

In this work, we propose OWSM-CTC, a novel encoder-only speech foundation model based on multi-task self-conditioned CTC to imitate OWSM’s multilingual ASR, any-to-any ST, and LID functionalities. Following previous encoder-decoder OWSM v3.1 models (Peng et al., 2024), we train a 1B OWSM-CTC model using 180k hours of public data covering 151 languages. Extensive evaluations show that our OWSM-CTC exhibits strong performance and efficiency. Compared to the 1B OWSM v3.1 medium model, OWSM-CTC achieves comparable performance for ASR and superior performance for various ST directions (up to 24% relative improvement) while being more robust and showing 3 to 4 times inference speed-up. OWSM-CTC also improves the WER for long-form ASR and can be 20 times faster due to batched parallel decoding. OWSM-CTC further outperforms the other baseline models on LID. Our code, pre-trained model weights, and training logs will be publicly released to facilitate the development of large speech models.

2 Related Work

2.1 Speech foundation models

Attention-based encoder-decoder. OpenAI Whisper (Radford et al., 2023) adopts the standard Transformer encoder-decoder architecture (Vaswani et al., 2017) and scales the training data to 680k hours of proprietary labeled audio.222Their latest large-v3 version uses 1M hours of labeled audio and 4M hours of pseudo-labeled audio. However, the complete pipeline for model development, including training data details and training code, is not publicly available. A recent project, OWSM, aims to reproduce Whisper-style training using public data and open-source toolkits to promote transparency and open science in this field (Peng et al., 2023e). The latest OWSM v3.1 models (Peng et al., 2024) employ E-Branchformer (Kim et al., 2023) as the encoder and Transformer as the decoder, which are trained with a joint ASR CTC loss (Kim et al., 2017). Although OWSM has promising results using public corpora, it still follows the encoder-decoder architecture, which can be slow and unstable at inference time.

Decoder-only. Several studies employ decoder-only models for speech-to-text tasks. AudioPaLM (Rubenstein et al., 2023) extends the textual PaLM-2 (Anil et al., 2023b) to support speech understanding and generation tasks including ASR and ST. DOTA (Gupta et al., 2024) is a decoder-only Transformer model trained on 93k hours of public English ASR data, but it does not support other languages or ST. Decoder-only models face the same slowness and robustness issues as encoder-decoder due to autoregressive decoding.

CTC or Transducer. Another line of research proposes to utilize CTC (Graves et al., 2006) or Transducer (Graves, 2012) for ASR. Google USM (Zhang et al., 2023) provides generic ASR models that are first pre-trained on 12M hours of unlabeled audio and then fine-tuned on proprietary labeled data with CTC or Transducer. Meta MMS (Pratap et al., 2023) pre-trains a wav2vec 2.0 model (Baevski et al., 2020) on massively multilingual data and then fine-tunes it with CTC on labeled ASR data covering over 1k languages. These models employ CTC only for ASR. In our OWSM-CTC, we propose a single CTC-based encoder-only model for ASR, ST, and LID. Our supported tasks are more similar to Whisper-style models.

2.2 Efficient speech models

Model compression. Various algorithms have been utilized to compress speech models, including knowledge distillation (Chang et al., 2022; Lee et al., 2022; Peng et al., 2023d; Gandhi et al., 2023), pruning (Lai et al., 2021; Peng et al., 2023a), quantization (Yeh et al., 2023; Ding et al., 2023), and dynamic module execution (Yoon et al., 2022; Peng et al., 2023c; Strimel et al., 2023). These methods are typically applied to pre-trained models and are thus orthogonal to this work. In the future, we will apply compression to further improve efficiency.

Efficient architectures. Better network architectures can also improve efficiency, including attention with linear complexity (Beltagy et al., 2020; Wang et al., 2020b; Tay et al., 2023) and sequence length reduction (Burchi and Vielzeuf, 2021; Kim et al., 2022; Nawrot et al., 2023; Rekesh et al., 2023). In this work, we do not modify the attention but use larger downsampling in the convolution module to reduce the sequence length. More details are in Appendix A.2 and B.1.

Refer to caption
Figure 2: Architecture of our OWSM-CTC. For an input audio, it predicts a language token along with ASR or ST text tokens depending on the task specifier. An optional text prompt can be provided, which mimics Whisper.

2.3 CTC-based speech models

Non-autoregressive models have a faster inference speed than their autoregressive counterparts due to parallel decoding. They have been utilized in machine translation (Gu et al., 2018; Ghazvininejad et al., 2019; Xiao et al., 2023), ASR (Chen et al., 2019; Higuchi et al., 2020; Ng et al., 2021; Chi et al., 2021; Lee and Watanabe, 2021; Nozaki and Komatsu, 2021), and ST (Inaguma et al., 2021; Chuang et al., 2021; Xu et al., 2023).

CTC is originally proposed to label sequences without explicit segmentation (Graves et al., 2006). CTC-based ASR models learn a monotonic alignment between speech features and text tokens. With parallel greedy decoding, they are much faster than autoregressive models. However, the accuracy of CTC is generally inferior due to the conditional independence assumption between output tokens. To address this issue, Intermediate CTC (InterCTC) (Lee and Watanabe, 2021) calculates additional CTC losses using intermediate representations from the encoder. Self-conditioned CTC Nozaki and Komatsu (2021) further extends InterCTC by adding back predictions of intermediate CTC layers to the subsequent encoder. These approaches have shown to be highly effective in speech-to-text generation tasks without a decoder (Higuchi et al., 2021).

Although CTC assumes a monotonic alignment between input and output, it can be used for ST with the reordering capability of self-attention (Inaguma et al., 2021; Chuang et al., 2021).

Conventional CTC models are typically designed for a specific task or language. It remains under-explored whether such approaches can be scaled to multilingual and multi-task scenarios. This work proposes a novel encoder-only speech foundation model based on multi-task self-conditioned CTC. This single model performs well in multilingual ASR, ST, and LID.

3 OWSM-CTC

3.1 Overall architecture

Figure 2 shows the architecture of OWSM-CTC. Its main component is a speech encoder, which takes speech features as input and predicts the spoken language as well as the ASR or ST hypothesis using CTC. To mimic Whisper-style models that condition text generation on an optional text prompt (Radford et al., 2023; Peng et al., 2023e, 2024), we employ a separate Transformer encoder to process the prompt and inject the output to the main model through cross-attention. Then, the model can potentially attend to the text prompt when generating text.

3.2 Speech encoder

For an input waveform, we first extract log Mel filterbanks and then apply a 2D convolution module to downsample the feature sequence along the time dimension. Let 𝐗speechT×dsubscript𝐗speechsuperscript𝑇𝑑\mathbf{X}_{\text{speech}}\in\mathbb{R}^{T\times d}bold_X start_POSTSUBSCRIPT speech end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_d end_POSTSUPERSCRIPT be the downsampled feature sequence of length T𝑇Titalic_T and feature size d𝑑ditalic_d. To specify the language and task, we prepend two special tokens to the sequence:

𝐗=concat(𝐞lang,𝐞task,𝐗speech),𝐗concatsubscript𝐞langsubscript𝐞tasksubscript𝐗speech\displaystyle\mathbf{X}=\text{concat}(\mathbf{e}_{\text{lang}},\mathbf{e}_{% \text{task}},\mathbf{X}_{\text{speech}}),bold_X = concat ( bold_e start_POSTSUBSCRIPT lang end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT task end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT speech end_POSTSUBSCRIPT ) , (1)

where concat()concat\text{concat}(\cdot)concat ( ⋅ ) is concatenation along time and 𝐞lang,𝐞task1×dsubscript𝐞langsubscript𝐞tasksuperscript1𝑑\mathbf{e}_{\text{lang}},\mathbf{e}_{\text{task}}\in\mathbb{R}^{1\times d}bold_e start_POSTSUBSCRIPT lang end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT task end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT are embeddings of special tokens <lang> and <task>, respectively. 𝐗𝐗\mathbf{X}bold_X now has shape (T+2)×d𝑇2𝑑(T+2)\times d( italic_T + 2 ) × italic_d. If the spoken language is known, the true language token will be used as input. Otherwise, a special token <nolang> denoting “unknown language” will be used. During training, we randomly replace the true language with <nolang> according to probability 0.5 so that either can be used for inference. The task token is <asr> for speech recognition and <st_lang> for translation to a target language.

Next, we add sinusoidal positional embeddings to 𝐗𝐗\mathbf{X}bold_X, and apply a stack of N𝑁Nitalic_N encoder layers:

𝐗(0)superscript𝐗0\displaystyle\mathbf{X}^{(0)}bold_X start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT =𝐗+PosEmb(𝐗),absent𝐗PosEmb𝐗\displaystyle=\mathbf{X}+\text{PosEmb}(\mathbf{X}),= bold_X + PosEmb ( bold_X ) , (2)
𝐗(l)superscript𝐗𝑙\displaystyle\mathbf{X}^{(l)}bold_X start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT =SpeechEnc(l)(𝐗(l1)),absentsuperscriptSpeechEnc𝑙superscript𝐗𝑙1\displaystyle=\text{SpeechEnc}^{(l)}(\mathbf{X}^{(l-1)}),= SpeechEnc start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_X start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) , (3)

where l𝑙litalic_l is a layer index from 1 to N𝑁Nitalic_N, PosEmb()PosEmb\text{PosEmb}(\cdot)PosEmb ( ⋅ ) generates positional embeddings, and SpeechEnc(l)()superscriptSpeechEnc𝑙\text{SpeechEnc}^{(l)}(\cdot)SpeechEnc start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( ⋅ ) is the l𝑙litalic_l-th encoder layer. The encoder is E-Branchformer (Kim et al., 2023), an enhanced version of Branchformer (Peng et al., 2022), which shows excellent performance across a wide range of benchmarks (Peng et al., 2023b).

We compute the CTC loss using the final encoder output 𝐗(N)superscript𝐗𝑁\mathbf{X}^{(N)}bold_X start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT and an augmented reference 𝐲tasksubscript𝐲task\mathbf{y}_{\text{task}}bold_y start_POSTSUBSCRIPT task end_POSTSUBSCRIPT. To create this reference, we simply preprend <lang> and <task> to the original groundtruth text of the desired task. Hence, the model will learn to predict the language token in addition to ASR or ST text tokens. This CTC loss is denoted as follows:

(4)

where 𝐖1d×Vsubscript𝐖1superscript𝑑𝑉\mathbf{W}_{1}\in\mathbb{R}^{d\times V}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_V end_POSTSUPERSCRIPT is a linear layer and V𝑉Vitalic_V is the size of the CTC vocabulary.

As discussed in Section 2.3, we apply self-conditioned CTC (Nozaki and Komatsu, 2021) at intermediate layers 𝒮{1,,N1}𝒮1𝑁1\mathcal{S}\subseteq\{1,\ldots,N-1\}caligraphic_S ⊆ { 1 , … , italic_N - 1 } to alleviate the conditional independence assumption of CTC. For any layer s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, Equation 3 is replaced by the following operations:

𝐀(s)superscript𝐀𝑠\displaystyle\mathbf{A}^{(s)}bold_A start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT =SpeechEnc(s)(𝐗(s1)),absentsuperscriptSpeechEnc𝑠superscript𝐗𝑠1\displaystyle=\text{SpeechEnc}^{(s)}(\mathbf{X}^{(s-1)}),= SpeechEnc start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ( bold_X start_POSTSUPERSCRIPT ( italic_s - 1 ) end_POSTSUPERSCRIPT ) , (5)
𝐁(s)superscript𝐁𝑠\displaystyle\mathbf{B}^{(s)}bold_B start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT =softmax(𝐀(s)𝐖1),absentsoftmaxsuperscript𝐀𝑠subscript𝐖1\displaystyle=\text{softmax}(\mathbf{A}^{(s)}\mathbf{W}_{1}),= softmax ( bold_A start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , (6)
𝐗(s)superscript𝐗𝑠\displaystyle\mathbf{X}^{(s)}bold_X start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT =𝐀(s)+𝐁(s)𝐖2,absentsuperscript𝐀𝑠superscript𝐁𝑠subscript𝐖2\displaystyle=\mathbf{A}^{(s)}+\mathbf{B}^{(s)}\mathbf{W}_{2},= bold_A start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT + bold_B start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (7)

where 𝐖2V×dsubscript𝐖2superscript𝑉𝑑\mathbf{W}_{2}\in\mathbb{R}^{V\times d}bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_d end_POSTSUPERSCRIPT is a linear layer. The intermediate CTC loss at layer s𝑠sitalic_s is defined as follows:

(s)=logPCTC(𝐲(s)𝐁(s)),superscript𝑠subscript𝑃CTCconditionalsuperscript𝐲𝑠superscript𝐁𝑠\displaystyle\mathcal{L}^{(s)}=-\log P_{\text{CTC}}(\mathbf{y}^{(s)}\mid% \mathbf{B}^{(s)}),caligraphic_L start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT = - roman_log italic_P start_POSTSUBSCRIPT CTC end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ∣ bold_B start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ) , (8)

where 𝐲(s)superscript𝐲𝑠\mathbf{y}^{(s)}bold_y start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT is the augmented reference at layer s𝑠sitalic_s. Similar to 𝐲tasksubscript𝐲task\mathbf{y}_{\text{task}}bold_y start_POSTSUBSCRIPT task end_POSTSUBSCRIPT in Equation 4, we prepend the language and task tokens to the original groundtruth text. Note that the choice of the reference text depends on the task. If the task for the current input is ASR, we simply use the ASR transcript to create 𝐲(s)superscript𝐲𝑠\mathbf{y}^{(s)}bold_y start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT for all s𝑠sitalic_s, which is consistent with conventional ASR models. However, if the task is ST, we empirically find that the model cannot converge if we use the translated text as the reference at all intermediate layers 𝒮𝒮\mathcal{S}caligraphic_S (see Appendix B.2 for discussions). Therefore, as shown in Figure 2, we utilize the ASR transcript at the first NASRsubscript𝑁ASRN_{\text{ASR}}italic_N start_POSTSUBSCRIPT ASR end_POSTSUBSCRIPT layers and the ST text at the remaining NSTsubscript𝑁STN_{\text{ST}}italic_N start_POSTSUBSCRIPT ST end_POSTSUBSCRIPT layers, where NASR+NST=|𝒮|N1subscript𝑁ASRsubscript𝑁ST𝒮𝑁1N_{\text{ASR}}+N_{\text{ST}}=|\mathcal{S}|\leq N-1italic_N start_POSTSUBSCRIPT ASR end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT ST end_POSTSUBSCRIPT = | caligraphic_S | ≤ italic_N - 1. This design mimics a cascaded system that first performs ASR and then ST, but our entire model is optimized jointly and trained from scratch. In other words, the first NASRsubscript𝑁ASRN_{\text{ASR}}italic_N start_POSTSUBSCRIPT ASR end_POSTSUBSCRIPT CTC layers always perform ASR regardless of the task token (named “ASR-only CTC”), whereas the other CTC layers are multi-tasking - they can perform ASR or ST according to the task token (named “task-specific or task-dependent CTC”).

The overall training loss is an average of the loss terms defined in Equation 4 and Equation 8:

total=11+|𝒮|((N)+s𝒮(s)).subscripttotal11𝒮superscript𝑁subscript𝑠𝒮superscript𝑠\displaystyle\mathcal{L}_{\text{total}}=\frac{1}{1+|\mathcal{S}|}\left(% \mathcal{L}^{(N)}+\sum_{s\in\mathcal{S}}\mathcal{L}^{(s)}\right).caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + | caligraphic_S | end_ARG ( caligraphic_L start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ) . (9)

3.3 Prompt encoder

Whisper-style models generate text conditioned on an optional text prompt (Radford et al., 2023; Peng et al., 2023e, 2024). During training, this prompt is simply the previous sentence in the same audio recording. During inference, it can be provided by the user to potentially adjust the output. For encoder-decoder models like Whisper, the text prompt is a prefix to the autoregressive decoder. For our encoder-only model, we leverage a separate Transformer encoder to process the prompt and inject it to the speech encoder through cross-attention. If no prompt is provided, a special token <na> will be used. Let 𝐗promptT×dsubscript𝐗promptsuperscriptsuperscript𝑇superscript𝑑\mathbf{X}_{\text{prompt}}\in\mathbb{R}^{T^{\prime}\times d^{\prime}}bold_X start_POSTSUBSCRIPT prompt end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT be the output of the prompt encoder. We insert a cross-attention layer at a subset of layers 𝒯{1,,N}𝒯1𝑁\mathcal{T}\subseteq\{1,\ldots,N\}caligraphic_T ⊆ { 1 , … , italic_N } of the speech encoder. For any t𝒯𝑡𝒯t\in\mathcal{T}italic_t ∈ caligraphic_T, the original SpeechEnc(t)()superscriptSpeechEnc𝑡\text{SpeechEnc}^{(t)}(\cdot)SpeechEnc start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( ⋅ ) in Equation 3 or Equation 5 becomes SpeechEncCA(t)(,)superscriptSpeechEncCA𝑡\text{SpeechEncCA}^{(t)}(\cdot,\cdot)SpeechEncCA start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( ⋅ , ⋅ ):

𝐃(t)=SpeechEnc(t)(𝐗(t1)),superscript𝐃𝑡superscriptSpeechEnc𝑡superscript𝐗𝑡1\displaystyle\mathbf{D}^{(t)}=\text{SpeechEnc}^{(t)}(\mathbf{X}^{(t-1)}),bold_D start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = SpeechEnc start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_X start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ) , (10)
SpeechEncCA(t)(𝐗(t1),𝐗prompt)=superscriptSpeechEncCA𝑡superscript𝐗𝑡1subscript𝐗promptabsent\displaystyle\text{SpeechEncCA}^{(t)}(\mathbf{X}^{(t-1)},\mathbf{X}_{\text{% prompt}})=SpeechEncCA start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( bold_X start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT prompt end_POSTSUBSCRIPT ) =
𝐃(t)+CrossAtt(𝐃(t),𝐗prompt,𝐗prompt),superscript𝐃𝑡CrossAttsuperscript𝐃𝑡subscript𝐗promptsubscript𝐗prompt\displaystyle~{}~{}~{}~{}\mathbf{D}^{(t)}+\text{CrossAtt}(\mathbf{D}^{(t)},% \mathbf{X}_{\text{prompt}},\mathbf{X}_{\text{prompt}}),bold_D start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + CrossAtt ( bold_D start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT prompt end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT prompt end_POSTSUBSCRIPT ) , (11)

where CrossAtt(,,)CrossAtt\text{CrossAtt}(\cdot,\cdot,\cdot)CrossAtt ( ⋅ , ⋅ , ⋅ ) is a cross-attention layer with three arguments: query, key, and value.

Our training data is a mixture of public ASR and ST datasets. Some of them provide unsegmented long audio, but the others only release segmented short audio. At training time, if a sample does not have a previous sentence, we will use <na>. Otherwise, we use either <na> or the previous sentence as the prompt according to 0.5 probability. Section 4.6 shows that OWSM-CTC can leverage the prompt’s information when necessary.

Params Time shift Training data GPU hours
Whisper (encoder-decoder) (Radford et al., 2023)
base 74M 20ms 680k hours unknown
small 244M 20ms 680k hours unknown
medium 769M 20ms 680k hours unknown
large-v2 1550M 20ms 680k hours unknown
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base 101M 40ms 180k hours 2.3k
medium 1.02B 40ms 180k hours 24.6k
OWSM-CTC (ours)
medium 1.01B 80ms 180k hours 19.2k
Table 1: Summary of model size, training data, and training cost measured on an NVIDIA A100 GPU (40GB).

4 Experiments

4.1 Experimental setups

Table 1 is a brief summary of model size, training data, and training cost.

Data format. Our training data is prepared using scripts publicly released by OWSM v3.1 (Peng et al., 2024). It is a mixture of more than 25 public ASR and ST corpora covering 151 languages and various translation directions. The total audio duration is 180k hours. To create long-form data, consecutive utterances from the same audio recording are concatenated to a duration of no more than 30 seconds. The input audio to the model is always padded to a fixed length of 30 seconds. Appendix A.1 and Table 11 present the training data statistics. The original Whisper-style data contains the start and end timestamps for each utterance. These timestamp tokens are predicted along with normal text tokens during the autoregressive decoding. In OWSM-CTC, we do not include any explicit timestamps since the time-aligned hypothesis can be obtained by forced alignment if desired.

Model architecture. Our speech encoder is a 27-layer E-Branchformer with a hidden size of 1024 and 16 attention heads. Four intermediate layers (6, 12, 15, and 21) are used for self-conditioned CTC. The first three are ASR only, while the others are task-specific. The prompt encoder is a 4-layer Transformer with a hidden size of 512 and 8 attention heads. It is injected into the speech encoder at every third layer. The total model size is 1.01B, which matches the size of the encoder-decoder OWSM v3.1 medium (1.02B). More details about the architecture are in Appendix A.2 (see Table 12).

Implementation. We implement OWSM-CTC in ESPnet (Watanabe et al., 2018) based on PyTorch (Paszke et al., 2019). FlashAttention (Dao et al., 2022) is used to improve training efficiency, but it is not used for inference. The batch size per GPU is 4, and 64 NVIDIA A100 GPUs (40GB) are used with distributed data parallel. The total training time is approximately 300 hours. For optimization, we employ the Adam optimizer (Kingma and Ba, 2015) with the piece-wise linear learning rate schedule (Peng et al., 2024). The peak learning rate is 2e-4. Other training hyperparameters can be found in Appendix A.3 (see Table 13).

Evaluation. We fairly compare our encoder-only OWSM-CTC with the previously released encoder-decoder OWSM v3.1 models (Peng et al., 2024) since they are trained on the same data. We also show the results of Whisper under the same decoding setup for reference, but we note that they are not comparable with ours due to completely different training data. By default, short-form audio without any text prompt is used, but we also evaluate the long-form ASR performance in Section 4.5 and investigate the effect of text prompts in Section 4.6.

Accuracy % (\uparrow)
Whisper (encoder-decoder) (Radford et al., 2023)
   base 47.6
   small 53.1
   medium 54.8
OWSM v3 (encoder-decoder) (Peng et al., 2023e)
   medium 81.4
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
   base 41.9
   medium 75.6
OWSM-CTC (ours)
   medium 87.6
Table 2: Spoken LID results on the FLEURS test set.

4.2 Language identification

Table 2 presents the LID results on the FLEURS test set (Conneau et al., 2023). Our OWSM-CTC achieves a top-1 accuracy of 87.6%, outperforming the other encoder-decoder models by a large margin. This is likely because spoken LID requires a powerful encoder to extract useful information from the input audio. Our encoder-only model is especially suitable for this type of task.

CommonVoice en

FLEURS en

LibriSpeech test-clean

LibriSpeech test-other

MLS en

Switchboard eval2000

TEDLIUM

VoxPopuli en

WSJ eval92

Average WER (\downarrow)

Speed-up (\uparrow)

Whisper (encoder-decoder) (Radford et al., 2023)
base 25.2 12.4 5.1 12.0 13.4 25.7 6.3 10.2 5.0 12.8 2.40x
small 15.7 9.6 3.3 7.7 9.1 22.2 4.6 8.5 4.3 9.4 1.46x
medium 11.9 6.4 2.8 6.5 10.2 19.4 5.1 7.6 2.9 8.1 0.76x
large-v2 10.5 6.0 4.1 6.1 7.7 24.0 6.0 7.1 3.3 8.3 0.55x
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base 21.5 14.8 3.6 9.1 12.0 22.9 7.8 12.0 5.3 12.1 2.97x
medium 12.6 9.0 2.4 5.0 7.1 16.3 5.1 8.4 3.5 7.7 1.00x
+ beam 5 11.7 8.5 2.7 5.3 6.6 15.5 5.1 8.5 3.4 7.5 0.06x
OWSM-CTC (ours)
medium 12.1 9.9 2.4 5.2 7.3 16.9 4.9 8.6 4.2 7.9 3.63x
Table 3: WER % (\downarrow) of English ASR. Speed-up (\uparrow) is based on average decoding time. Whisper is trained on 438k hours of English audio, whereas OWSM v3.1 and our OWSM-CTC are trained on only 73k hours. Results of Whisper large-v2 and OWSM v3.1 medium with beam search are shown in gray, which are not comparable with the others due to different model sizes or decoding configurations. Bold: the best result. Underlined: OWSM-CTC outperforms OWSM v3.1 medium.

MLS es

MLS fr

MLS de

MLS nl

MLS it

MLS pt

MLS pl

AISHELL-1 (zh)

KsponSpeech clean (ko)

KsponSpeech other (ko)

ReazonSpeech (ja)

Average Error Rate (\downarrow)

data size 11.1 9.8 13.3 2.1 2.6 8.6 4.3 23.4 8.0 8.0 7.1
Whisper (encoder-decoder) (Radford et al., 2023)
base 14.5 25.2 19.9 30.9 32.9 23.5 25.2 39.1 27.0 22.9 54.1 28.7
small 9.1 13.6 11.5 18.2 21.3 13.8 12.5 25.1 24.0 15.4 32.5 17.9
medium 6.1 9.7 8.1 12.2 15.6 8.9 6.8 15.7 17.6 12.8 25.3 12.6
large-v2 4.8 7.0 6.3 9.7 13.2 6.6 5.5 18.3 20.0 13.1 26.8 11.9
data size 2.0 2.5 3.7 1.7 0.7 0.3 0.3 16.3 1.0 1.0 18.9
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base 18.5 24.2 18.7 28.6 33.7 44.9 49.7 12.2 23.8 26.1 11.2 26.5
medium 9.0 12.1 10.8 18.1 20.2 21.6 25.2 6.4 16.7 18.9 7.9 15.2
+ beam 5 8.6 11.2 10.2 17.2 19.1 19.4 23.4 5.9 15.0 17.0 7.8 14.1
OWSM-CTC (ours)
medium 10.3 12.9 11.9 20.4 22.1 23.5 31.6 6.4 14.8 16.5 8.1 16.2
Table 4: Multilingual ASR results. CER % (\downarrow) is shown for Chinese (zh), Korean (ko) and Japanese (ja), while WER % (\downarrow) is shown for the others. Data sizes are in thousand hours. Results of Whisper large-v2 and OWSM v3.1 medium with beam search are shown in gray, which are not comparable with the others. Bold: the best result. Underlined: OWSM-CTC outperforms OWSM v3.1 medium.

4.3 Speech recognition

Table 3 presents word error rates (WERs) on nine English ASR test sets. Following Peng et al. (2023e, 2024), we leverage greedy decoding and apply the Whisper English text normalizer before scoring.333We also report the results of Whisper large-v2 and OWSM v3.1 medium with beam search in gray for reference, but they are not comparable with the others due to different model sizes or decoding configurations. This applies to other tables as well. We record the average decoding time across all English test sets on an NVIDIA A40 GPU and calculate the relative speed-up. Results show that our non-autoregressive OWSM-CTC generally has comparable WERs with the autoregressive OWSM v3.1 medium (average: 7.9 vs. 7.7), both of which have 1B parameters. However, OWSM-CTC achieves 3.63x speed-up due to parallel decoding. Notably, OWSM-CTC is even faster than OWSM v3.1 base, which has only 100M parameters, and our WERs are much lower (average: 7.9 vs. 12.1). Compared to Whisper models trained on significantly more data, our OWSM-CTC is still competitive in many cases, and our inference is much faster. These results demonstrate that OWSM-CTC achieves an excellent trade-off between recognition accuracy and inference efficiency.

Table 4 shows the results of multilingual ASR. We perform greedy decoding and apply the Whisper basic text normalizer before scoring. Our OWSM-CTC is slightly worse than OWSM v3.1 in terms of the average WER/CER (16.2 vs. 15.2). For European languages in MLS (Pratap et al., 2020), OWSM-CTC generally falls behind. But for East Asian languages like Chinese, Japanese, and Korean, OWSM-CTC is on par with or better than OWSM v3.1 medium. This difference might be related to the training data size and tokenization.

Src Lang. de es fr ca Ave. (\uparrow) Speed-up (\uparrow)
data size 4.3 6.7 4.5 0.2
Whisper (encoder-decoder) (Radford et al., 2023)
base 11.0 18.9 13.2 9.9 13.3 1.84x
small 23.9 31.8 26.1 21.4 25.8 1.54x
medium 32.0 37.3 33.4 28.8 32.9 0.84x
large-v2 35.2 39.7 35.7 31.2 35.5 0.48x
data size 0.2 0.1 0.3 0.1
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base 7.1 10.3 11.5 9.4 9.6 2.78x
medium 16.7 22.3 22.8 18.8 20.2 1.00x
+ beam 5 18.2 24.5 24.4 21.1 22.1 0.05x
OWSM-CTC (ours)
medium 20.7 27.9 27.5 24.2 25.1 3.35x
Table 5: BLEU (\uparrow) of X-to-En ST on CoVoST-2. Data sizes are in thousand hours. Results of Whisper large-v2 and OWSM v3.1 medium with beam search are shown in gray, which are not comparable with the others. Bold: the best result. Underlined: OWSM-CTC outperforms OWSM v3.1 medium.
Tgt Lang. de ca zh fa et mn tr ar sv lv sl ta ja id cy Ave. (\uparrow) Speed-up (\uparrow)
data size 14.0 0.4 13.7 0.8 0.4 0.4 0.9 0.9 0.4 0.4 0.4 0.4 1.0 0.4 0.4 - -
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base 15.8 8.3 13.0 3.3 3.1 1.6 2.0 1.7 8.7 2.3 1.3 0.0 10.6 6.1 5.0 5.5 2.39x
medium 26.3 20.4 29.7 10.2 9.6 5.8 7.8 7.2 20.8 8.4 11.0 0.1 21.1 17.2 16.3 14.1 1.00x
+ beam 5 27.3 22.5 31.3 11.1 11.1 6.9 9.1 8.4 22.3 9.9 12.7 0.1 22.3 19.7 17.9 15.5 0.05x
OWSM-CTC (ours)
medium 26.7 24.0 32.9 9.9 11.4 6.2 7.9 8.3 24.5 10.0 14.2 0.1 20.4 22.6 20.6 16.0 4.20x
p-value 0.006 0.001 0.001 0.001 0.001 0.001 0.145 0.001 0.001 0.001 0.001 0.031 0.001 0.001 0.001 - -
Table 6: BLEU (\uparrow) of En-to-X ST on CoVoST-2. Data sizes are in thousand hours. Note that Whisper does not support En-to-X translation. The p-values are computed by comparing OWSM-CTC against OWSM v3.1 medium using the Paired Significance Test in SacreBLEU (Post, 2018). Results of OWSM v3.1 medium with beam search are shown in gray, which are not comparable with the others.

4.4 Speech translation

We evaluate ST on CoVoST-2 test sets (Wang et al., 2020a). By default, we perform greedy decoding and calculate BLEU scores in true case with punctuation.444Results in lowercase without punctuation can be found in Appendix C, which are consistent with previous OWSM work (Peng et al., 2024). For X-to-En translation, we follow OWSM v3.1 (Peng et al., 2024) to report results of directions where the training data size is over 100 hours. For the other low-resource directions, both OWSM v3.1 and our OWSM-CTC do not work in general. For En-to-X translation, we report all 15 directions. We calculate the speed-up based on the average decoding time on an NIVIDA A40 GPU.

Table 5 shows the X-to-En results. Notably, our encoder-only OWSM-CTC consistently outperforms the encoder-decoder OWSM v3.1 by a large margin. The average BLEU score is improved from 20.2 to 25.1 (24% relatively). We also achieve 3.35x speed-up for inference.

Table 6 presents En-to-X results. Whisper does not support these directions. Our OWSM-CTC achieves superior performance than OWSM v3.1 in 12 of 15 translation directions and most of them are statistically significant. The average BLEU is improved from 14.1 to 16.0 (13% relatively), and the inference speed-up is 4.20 times.

We have the following observations from the ST results: (1) Our non-autoregressive OWSM-CTC generally achieves 3 to 4 times speed-up compared to the encoder-decoder baseline, which is consistent with ASR. (2) OWSM-CTC even improves the ST performance sometimes by a large margin. One reason is that the autoregressive model suffers from hallucination and error propagation, while the non-autoregressive model is more stable. (3) The BLEU improvement of X-to-En is larger than that of En-to-X, likely because: (i) the OWSM training set contains lots of English ASR data and OWSM-CTC might obtain strong capability of generating English text; (ii) X-to-En has fewer training data than En-to-X, and the encoder-decoder model may need a sufficient amount of training data to achieve good performance for translation.

Our findings reveal that large-scale CTC-based models are also promising for ST in various language pairs, which is consistent with prior investigations at smaller scales (Yan et al., 2023).

Context Length WER % (\downarrow) Speed-up (\uparrow)
Whisper (encoder-decoder) (Radford et al., 2023)
base - 5.3 1.40x
small - 4.4 1.62x
medium - 3.8 0.86x
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base - 9.6 1.40x
medium - 5.7 1.00x
OWSM-CTC (ours)
medium 2s 5.4 22.40x
4s 5.2 19.35x
6s 5.2 16.07x
8s 5.2 12.09x
Table 7: Long-form ASR results on the TEDLIUM (Hernandez et al., 2018) test set which consists of 11 audio recordings ranging from 6 to 27 minutes. Bold: the best result. Underlined: OWSM-CTC outperforms OWSM v3.1 medium.

4.5 Long-form speech recognition

For long-form ASR, a model takes as input an unsegmented audio recording of arbitrary length and generates the entire transcription without explicit voice activity detection. Whisper and encoder-decoder OWSM can predict start and end timestamps of each utterance within a fixed-length segment. Those timestamps are used to shift the recognition window for chunk-wise long-form ASR. However, this chunk-wise recognition is a sequential process because the location of the next chunk depends on the predicted timestamp in the current chunk.555The decoding process might be parallelized if token-level timestamps are available. However, it remains an open problem to derive accurate token-level timestamps from an attention-based encoder-decoder model without extra training. By contrast, our OWSM-CTC performs chunk-wise recognition in a fully parallel manner. We first split the entire audio into overlapped chunks of 30s, where the overlapped region serves as the left and right context.666We follow this tutorial for long-form ASR with CTC: https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Streaming_ASR.ipynb We then perform CTC greedy decoding on batched chunks. The batch size is 32 on a single NVIDIA A40 GPU (48GB). Table 7 shows the WER and speed-up with different context lengths. Our OWSM-CTC achieves lower WERs than the encoder-decoder OWSM v3.1, while being approximately 20 times faster due to the batched parallel decoding. OWSM-CTC is also robust to different context lengths. These observations indicate that CTC-based non-autoregressive models perform very well for long-form ASR, which is consistent with prior findings (Koluguri et al., 2023).

GigaSpeech LS-clean LS-other SWBD TEDLIUM AISHELL
w/o prev 11.80 2.42 5.22 16.92 4.95 6.37
w/ prev 11.23 2.38 5.10 16.70 4.55 6.25
p-value <0.001 0.19 0.007 <0.001 <0.001 <0.001
Table 8: Using the previous sentence as a text prompt improves the ASR WER/CER of OWSM-CTC.

4.6 Effect of text prompt

As described in Figure 2 and Section 3.3, OWSM-CTC can take an additional text prompt as input which might change the output. During training, either a special token <na> or the previous sentence in the same audio is used as the prompt according to a probability of 0.5, which follows the setup of Whisper and OWSM. To verify that OWSM-CTC can utilize information from the prompt when necessary, we perform greedy decoding on several test sets with the previous sentence in the dataset as a prompt. As shown in Table 8, using the previous sentence reduces the error rates. The p-values are computed using the Matched Pair Sentence Segment method.777https://github.com/usnistgov/SCTK Appendix D provides an example where the previous sentence also affects the output text style.

Input length 5s 10s 20s
Whisper (encoder-decoder) (Radford et al., 2023)
large-v3 Fjell Fusilet Rekordverk
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
medium thank you thank you (Applause)
OWSM-CTC (ours)
medium . ( ( )
Table 9: ASR outputs with random noise as input.

4.7 Robustness

To investigate the robustness, we first consider random noise as input. Table 9 shows the ASR outputs generated by three models. Encoder-decoder models, including Whisper and OWSM v3.1, tend to generate some texts that look meaningful, while our OWSM-CTC generates fewer tokens, which are mostly punctuation marks that do not actually have meaning.

Another typical issue of autoregressive decoding is that the generation might fall into repetitions of a few characters or words. Table 19 in Appendix E presents two examples from ASR and ST, respectively. Our non-autoregressive model is more robust in such cases. To quantitatively measure this type of error, we consider a hypothesis as a failure if it contains any character-level θ𝜃\thetaitalic_θ-gram (θ=1,2,,θmax𝜃12subscript𝜃max\theta=1,2,\dots,\theta_{\text{max}}italic_θ = 1 , 2 , … , italic_θ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT) that consecutively occurs for at least δ𝛿\deltaitalic_δ times. Table 10 shows the number of failures in all ST test sets with different thresholds. We can see that the encoder-decoder OWSM v3.1 medium fails many times even with beam search, while our OWSM-CTC has almost no failures.

θmaxsubscript𝜃max\theta_{\text{max}}italic_θ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT δ𝛿\deltaitalic_δ Model #Failures (\downarrow)
10 5 OWSM v3.1 2448
OWSM v3.1 (beam 5) 630
OWSM-CTC (ours) 3
20 5 OWSM v3.1 2537
OWSM v3.1 (beam 5) 672
OWSM-CTC (ours) 3
20 6 OWSM v3.1 1985
OWSM v3.1 (beam 5) 453
OWSM-CTC (ours) 1
Table 10: Comparison of the number of decoding failures in all ST test sets. There are 286k samples in total.

5 Conclusion

We propose OWSM-CTC, a novel encoder-only speech foundation model built upon 180k hours of public audio data and open-source toolkits. OWSM-CTC employs multi-task self-conditioned CTC for multilingual ASR, any-to-any ST, and LID. We conduct extensive experiments to compare OWSM-CTC with the encoder-decoder OWSM models trained on the same data. We find that OWSM-CTC achieves competitive performance on ASR and superior performance on ST for both X-to-En (24% relative improvement) and En-to-X (13% relative improvement), while being more robust and 3 to 4 times faster at inference time. Additionally, OWSM-CTC improves the long-form ASR WER with 20 times faster inference due to the batched parallel decoding. OWSM-CTC also outperforms the baselines on LID. To promote open research on large speech models, we will publicly release our code, pre-trained model weights and training logs.

Limitations

Although OWSM-CTC reduces the training cost by 22% compared to OWSM v3.1, it still requires nearly 20k GPU hours, which is nontrivial. OWSM-CTC can generate incorrect ASR or ST outputs due to limited training data in certain languages. Care should be taken when using our model for low-resource ASR or ST. Besides, we have only evaluated our model with greedy decoding as it has the fastest inference speed. The non-autoregressive model sometimes makes mistakes in spelling or grammar due to a lack of language models.

Broader Impacts and Ethics

Our OWSM-CTC is a novel encoder-only speech foundation model built upon public datasets and open-source toolkits. Compared to other popular choices, it achieves very strong performance and efficiency. We adhere to the ACL ethics policy and there is no violation of privacy in our experiments. We plan to publicly release all scripts, pre-trained models, and training logs, which can promote transparency and open science. We believe this will benefit the entire speech research community and it can make the latest speech technology available to a broader range of people all over the world.

Acknowledgements

Our computing resources are supported by PSC Bridges2 and NCSA Delta via ACCESS allocation CIS210014, under National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

References

Appendix A Details of Experimental Setups

Model Unlabeled English ASR Other ASR ST Languages Vocabulary Size
Whisper (Radford et al., 2023)
   Initial versions - 438k hours 117k hours 125k hours 99 52k
   large-v3 4M hours 1M hours of labeled in total 100 52k
OWSM v3.1 (Peng et al., 2024)
- 73k hours 67k hours 40k hours 151 50k
OWSM-CTC (ours)
- 73k hours 67k hours 40k hours 151 50k
Table 11: Details of training data. Our data is prepared using the scripts released by OWSM v3.1 (Peng et al., 2024).
Model Params Encoder Decoder Layers Hidden Size Attention Heads Time Shift
Whisper (Radford et al., 2023)
   tiny 39M Transformer Transformer 4 384 6 20ms
   base 74M Transformer Transformer 6 512 8 20ms
   small 244M Transformer Transformer 12 768 12 20ms
   medium 769M Transformer Transformer 24 1024 16 20ms
   large 1.55B Transformer Transformer 32 1280 20 20ms
   large-v3 1.55B Transformer Transformer 32 1280 20 20ms
OWSM v3.1 (Peng et al., 2024)
   base 101M E-Branchformer Transformer 6 384 6 40ms
   medium 1.02B E-Branchformer Transformer 18 1024 16 40ms
OWSM-CTC (ours)
   medium 1.01B E-Branchformer - 27 1024 16 80ms
Table 12: Details of model architectures. Whisper (Radford et al., 2023) and OWSM v3.1 (Peng et al., 2024) are encoder-decoder models, whereas our OWSM-CTC is an encoder-only model. We mostly follow the design of OWSM v3.1 medium, but we increase the number of encoder layers to match the overall model size.
Model Batch Size Total Steps Warmup Steps Max Learning Rate InterCTC Layers 𝒮𝒮\mathcal{S}caligraphic_S
OWSM v3.1 (Peng et al., 2024)
   base 256 675k 60k 1e-3 -
   medium 256 675k 60k 2e-4 -
OWSM-CTC (ours)
   medium 256 600k 60k 2e-4 6, 12, 15, 21
Table 13: Training hyperparameters. We mostly follow the training setups of OWSM v3.1 medium (Peng et al., 2024). As described in Section 3.2, we employ self-conditioned CTC at four intermediate layers.
Downsampling Strategy Params GPU VRAM (\downarrow) Speed-up (\uparrow) ASR WER (\downarrow) ST BLEU (\uparrow)
4x in CNN 55M 38GB 1.00x 8.3 22.0
6x in CNN 55M 22GB 1.12x 8.6 21.3
8x in CNN 55M 19GB 1.13x 8.8 21.5
4x in CNN + 2x in the middle of Encoder 55M 38GB 1.03x 9.7 21.6
Table 14: Comparison of different downsampling strategies on MuST-C v2 En-De. The other configurations, such as batch size, are kept the same. Using 4x downsampling achieves the best ASR and ST results, while using 8x downsampling significantly reduces the GPU memory usage, which enables a larger batch size per GPU. We employ 8x downsampling in our large-scale OWSM-CTC to reduce training costs.
ASR-Only CTC Layers Task-Dependent CTC Layers ASR WER (\downarrow) ST BLEU (\uparrow)
- 6, 12, 18, 24 diverged
6 12, 18, 24 9.0 21.6
6, 12 18, 24 8.8 21.5
6, 12, 18 24 8.4 21.2
Table 15: Effect of the CTC type. This small-scale model has 24 layers with 8x downsampling in CNN. As described in Section 3.2, we employ self-conditioned CTC at some intermediate layers. These CTC layers can perform a single task like ASR or multiple tasks depending on the task specifier. If we allow all CTC layers to perform multiple tasks (ASR and ST), the model cannot converge from scratch. Therefore, we leverage the first few CTC layers for ASR only and the remaining ones for multi-tasking.
Src Lang. de es fr ca Average (\uparrow) Speed-up (\uparrow)
data size 4.3 6.7 4.5 0.2
Whisper (encoder-decoder) (Radford et al., 2023)
base 11.4 19.2 13.1 9.7 13.4 1.84x
small 25.0 32.8 26.4 21.7 26.5 1.54x
medium 33.6 39.7 34.4 29.2 34.2 0.84x
data size 0.2 0.1 0.3 0.1
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base 7.3 10.0 11.1 9.0 9.4 2.78x
medium 17.1 22.3 22.7 18.4 20.1 1.00x
OWSM-CTC (ours)
medium 21.1 28.2 27.7 23.7 25.2 3.35x
Table 16: BLEU (\uparrow) of X-to-En ST on CoVoST-2 using lowercase without punctuation. Data sizes are in thousand hours. Bold: the best result. Underlined: our OWSM-CTC outperforms OWSM v3.1 medium.
Tgt Lang. de ca zh fa et mn tr ar sv lv sl ta ja id cy Average (\uparrow) Speed-up (\uparrow)
data size 14.0 0.4 13.7 0.8 0.4 0.4 0.9 0.9 0.4 0.4 0.4 0.4 1.0 0.4 0.4
OWSM v3.1 (encoder-decoder) (Peng et al., 2024)
base 14.6 7.7 14.5 3.0 1.8 1.0 1.2 1.6 8.1 1.3 0.7 0.0 8.7 5.1 4.5 4.9 2.39x
medium 25.4 19.6 32.1 10.1 7.7 4.6 6.5 7.2 20.3 6.4 9.0 0.0 19.6 16.1 15.3 13.3 1.00x
OWSM-CTC (ours)
medium 25.5 23.0 35.1 10.0 9.2 4.8 6.8 8.2 23.8 7.7 12.0 0.0 18.5 21.0 19.4 15.0 4.20x
Table 17: BLEU (\uparrow) of En-to-X ST on CoVoST-2 using lowercase without punctuation. Data sizes are in thousand hours. Bold: the best result. Underlined: our OWSM-CTC outperforms OWSM v3.1 medium. Note that Whisper does not support En-to-X translation.
Input audio content Previous sentence ASR w/o previous ASR w/ previous
future ’s over here wind sun a new energy grid new investments to create high paying jobs repower america it ’s time to get real there is an old african proverb that says if you want to go quickly go alone if you want to go far go together we need to go far quickly thank you very much with one hundred percent clean electricity within ten years a plan to put america back to work make us more secure and help stop global warming finally a solution that ’s big enough to solve our problems repower america find out more this is the last one it ’s about repowering america one of the fastest ways to cut our dependence on old dirty fuels that are killing our planet Future’s over here. Wind, sun. A new energy grid. New investments to create high-pan jobs. Repower America. It’s time to get real. There’s an old African proverb that says, "If you want to go quickly, go alone. if you want to go far, go together." We need to go far quickly. Thank you very much. (Applause) future ’s over here wind sun a new energy grid new investments to create high pan jobsrepower america it ’s time to get real there ’s an old african proverb that says if you want to go quickly go alone if you want to go far go together we need to go far quickly thank you very much
Table 18: Using a previous sentence as the prompt might change the output style. The optional prompt encoder is defined in Figure 2 and Section 3.3.
Groundtruth reference OWSM v3.1 output OWSM-CTC output (ours)
in search of the mythical treasure your grandfather is supposed to have secreted there he laughed and the girl instinctively shuddered with a newborn distrust there was no mirth in the sound in search of the mythical treasure your grandfather is supposed to have secreted there ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha … in search of the mythical treasure your grandfather is supposed to have secreted there he laughed and the girl instinctively shuddered with a new-born distrust there was no mirth in the sound
and with her they began a national tour that took them all around the country they take a national gira which leads to rerererererererererererererere … with learn a national tour that leads them to run the entire country
Table 19: Autoregressive decoding sometimes gets trapped in a loop in both ASR (row 1, MLS En) and ST (row 2, CoVoST-2 Es-En). Our OWSM-CTC is more robust.

A.1 Training data

Table 11 summarizes the training data statistics. We prepare the training data mixture using the scripts publicly released by OWSM v3.1 (Peng et al., 2024). This ensures a fair comparison between our OWSM-CTC and the previously released encoder-decoder OWSM models.

Our use of the data is consistent with their intended use. These datasets have been widely used in speech research. They do not violate the privacy of creators or users, nor do they contain any offensive content. Specifically, the individual training datasets and licenses are listed below: AIDATATANG (CC BY-NC-ND 4.0)888https://www.openslr.org/62/, AISHELL-1 (Apache 2.0) Bu et al. (2017), AMI (CC BY 4.0) Carletta (2007), Babel999https://www.iarpa.gov/research-programs/babel, CommonVoice (CC0-1.0) Ardila et al. (2020), CoVoST2 (CC BY-NC 4.0) Wang et al. (2020a), Fisher Switchboard (LDC) Godfrey et al. (1992), Fisher Callhome Spanish (LDC) Post et al. (2013), FLEURS (CC-BY-4.0) Conneau et al. (2023), Googlei18n101010Resources 32, 35, 36, 37, 41, 42, 43, 44, 52, 53, 54, 61, 63, 64, 65, 66, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, and 86 from openslr.org., GigaSpeech (Apache 2.0) Chen et al. (2021), GigaST (CC BY-NC 4.0) Ye et al. (2022), KsponSpeech (MIT License) Bang et al. (2020), LibriSpeech (CC BY 4.0) Panayotov et al. (2015), Multilingual LibriSpeech (CC BY 4.0) Pratap et al. (2020), MagicData (CC BY-NC-ND 4.0)111111https://openslr.org/68/, MuST-C (CC BY NC ND 4.0 International) Cattoni et al. (2021), SPGISpeech O’Neill et al. (2021), TEDLIUM3 (CC BY-NC-ND 3.0) Hernandez et al. (2018), ReazonSpeech (Apache 2.0 / CDLA-Sharing-1.0) Yin et al. (2023), Russian OpenSTT (CC-BY-NC)121212https://github.com/snakers4/open_stt, VCTK (CC BY 4.0)131313https://huggingface.co/datasets/vctk, VoxForge (GPL)141414https://www.voxforge.org/, VoxPopuli (Attribution-NonCommercial 4.0 International) Wang et al. (2021), WenetSpeech (Creative Commons Attribution 4.0 International License) Zhang et al. (2022).

A.2 Model architectures

Table 12 shows the model configurations. Our OWSM-CTC mostly follows the design of OWSM v3.1 medium (Peng et al., 2024), but we only use an encoder. To match the total model size, we increase the number of layers to 27, leading to a total of 1B parameters. Note that the sequence length of the encoder is usually longer than that of the decoder. Hence, the encoder-only model can have a higher computational cost than the encoder-decoder model. To alleviate this issue, we apply a larger downsampling rate in the CNN module to reduce the sequence length. Our final time shift is 80ms, as opposed to 40ms of the encoder-decoder OWSM models. We observe that our training time for a fixed number of updates is roughly the same as that of OWSM v3.1 medium. We also investigated different downsampling strategies at a smaller scale, as discussed in Appendix B.1 and Table 14.

A.3 Training hyperparameters

Table 13 presents the training hyperparameters of OWSM v3.1 and our OWSM-CTC. Again, we follow the previous OWSM v3.1 (Peng et al., 2024) for a fair comparison, except that we adopt self-conditioned CTC (Nozaki and Komatsu, 2021) at four intermediate layers (see Section 3.2).

Appendix B Small-Scale Ablation Studies

Before the large-scale training using the entire 180k hours of audio data, we conducted preliminary experiments on MuST-C v2 En-De (Cattoni et al., 2021) to investigate the effect of the CNN downsampling rate and the choice of the task for intermediate CTC layers. Specifically, we train 24-layer E-Branchformer-CTC models on the combined ASR and ST data from MuST-C v2 En-De. The input is always English audio, but the output can be the English ASR transcript or its German translation depending on the task specifier (see Figure 2).

B.1 Effect of downsampling strategies

Table 14 compares different downsampling strategies while the other configurations are kept the same. The attention is implemented with FlashAttention (Dao et al., 2022). Self-conditioned CTC is applied at three intermediate layers: 6, 12, and 18. The first two CTC layers always perform ASR, while the others are task-dependent. The results show that using 8x downsampling in the CNN module leads to a slight degradation on WER and BLEU but reduces the GPU memory usage by half. We thus decide to employ 8x downsampling in our large-scale OWSM-CTC, enabling a doubled batch size per GPU. As mentioned in Appendix A.2, with this strategy, we observe a similar training speed compared to the encoder-decoder OWSM model.

B.2 Choice of the CTC task

As discussed in Section 3.2, the intermediate CTC layers can be configured to perform a specific task like ASR or multiple tasks depending on the input task token. Table 15 compares different choices at a small scale using MuST-C v2 En-De. If all CTC layers are task-dependent (i.e., multi-tasking), the model cannot converge when trained from scratch. As more layers are used for ASR only, the ASR WER improves, but the ST BLEU decreases slightly. A good trade-off is to use the first half for ASR only and the second half for multi-tasking. Therefore, in our large-scale OWSM-CTC with 27 layers, we configure the 6th, 12th, and 15th layers to perform ASR only and the other two CTC layers (i.e., 21st and 27th layers) to be multi-tasking. This design also mimics the conventional cascaded system for ST.

Appendix C More Results of ST

Section 4.4 shows the BLEU scores using true case with punctuation. In this section, Table 16 and Table 17 present BLEU in lowercase without punctuation, which is consistent with the setup in prior work (Peng et al., 2024). The findings are very consistent with those in Section 4.4. Our OWSM-CTC achieves higher BLEU scores with faster inference speeds than the encoder-decoder OWSM v3.1 in general.

Appendix D Effect of text prompt

Table 18 presents an example from TEDLIUM, where the text prompt changes the output style. When there is no prompt, the ASR output of OWSM-CTC is in true case with punctuation, and the apostrophes are combined with the previous words. However, when the previous sentence is used as a prompt, the style of the ASR hypothesis becomes more similar to that of the prompt. Specifically, the text is now in lowercase without punctuation marks, and the apostrophes are separate from previous words. This style is closer to the groundtruth transcript.

Although the above example looks promising for biasing the model’s output toward certain directions, we note that this is not guaranteed to work in a zero-shot manner. We have also tried a few examples for zero-shot contextual biasing, where we provide a few biasing words in the prompt (e.g., person names), but we find that the model may not generate the correct word in many cases. This is mainly because the model is not really trained to perform this type of task - we just provide the previous sentence (according to some probability) as the prompt during training, which might not be useful at all; thus, the non-autoregressive model can simply ignore it in most cases. A more practical way to utilize this feature is to fine-tune our pre-trained model using some carefully designed data for contextual biasing. We will explore this in the future.

Appendix E Robustness

Table 19 shows that autoregressive decoding sometimes fails to generate the correct output for either ASR or ST, while non-autoregressive decoding is generally more robust to this type of error.