Revisiting Interpolation Augmentation for Speech-to-Text Generation

Chen Xu¹, Jie Wang², Xiaoqian Liu², Qianqian Dong³, Chunliang Zhang^2,4,
Tong Xiao^2,4, **gbo Zhu^2,4, Dapeng Man¹ and Wu Yang¹
¹College of Computer Science and Technology, Harbin Engineering University, Harbin, China
²School of Computer Science and Engineering, Northeastern University, Shenyang, China
³ByteDance
⁴NiuTrans Research, Shenyang, China
{chen.xu, mandapeng, yangwu}@hrbeu.edu.cn
{wangjienlp, liuxiaoqian0319}@outlook.com, [email protected]
{zhangchunliang, xiaotong, zhu**gbo}@mail.neu.edu.cn Corresponding author.

Abstract

Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique’s application in S2T tasks has remained under-explored. In this paper, we delve into the utility of interpolation augmentation, guided by several pivotal questions. Our findings reveal that employing an appropriate strategy in interpolation augmentation significantly enhances performance across diverse tasks, architectures, and data scales, offering a promising avenue for more robust S2T systems in resource-constrained settings.¹¹1The source code is available at https://github.com/xuchen
nlp/S2T.

Chen Xu¹, Jie Wang², Xiaoqian Liu², Qianqian Dong³, Chunliang Zhang^2,4, Tong Xiao^2,4, **gbo Zhu^2,4, Dapeng Man¹^†^†thanks: Corresponding author. and Wu Yang¹ ¹College of Computer Science and Technology, Harbin Engineering University, Harbin, China ²School of Computer Science and Engineering, Northeastern University, Shenyang, China ³ByteDance ⁴NiuTrans Research, Shenyang, China {chen.xu, mandapeng, yangwu}@hrbeu.edu.cn {wangjienlp, liuxiaoqian0319}@outlook.com, [email protected] {zhangchunliang, xiaotong, zhu**gbo}@mail.neu.edu.cn

1 Introduction

Recently, neural network-based end-to-end systems have achieved impressive improvements and become the de facto modeling method for speech-to-text (S2T) generation tasks, such as automatic speech recognition (ASR) Karita et al. (2019) and automatic speech translation (AST) Xu et al. (2023b). These deep learning models typically comprise millions or even billions of parameters and require vast amounts of training data to achieve state-of-the-art performance Zhang et al. (2022). For example, leading ASR models demand thousands of hours of training data Lu et al. (2020). However, the labeling of such extensive datasets leads to significant costs, and models trained on limited data are prone to overfitting, resulting in suboptimal generalization to unseen samples Ying (2019).

To enhance generalization capabilities, data augmentation has become a key strategy Shorten and Khoshgoftaar (2019). Existing approaches in S2T can be broadly classified into two categories: online and offline augmentation. Online methods, such as SpecAugment Park et al. (2019), enhance regularization by transforming the input representation during training. By introducing random noise into input features, these techniques have become standard in S2T tasks. Offline methods, on the other hand, boost data diversity by creating large amounts of pseudo-data through original audio distortion Ko et al. (2015) or synthesis Rosenberg et al. (2019). Though effective, these offline techniques are separate from the training process, often requiring additional steps and computational resources. This creates a demand for more efficient solutions.

We resort to interpolation augmentation (IPA), also known as Mixup, a notable method first introduced in image classification Zhang et al. (2017). IPA mitigates overfitting by constructing virtual samples through linear interpolation of both input features and labels from two randomly selected samples. This approach has achieved impressive success across diverse domains, including speech processing Medennikov et al. (2018); Lam et al. (2020); Meng et al. (2021); Kang et al. (2023), natural language processing Guo et al. (2019); Sun et al. (2020); Xie et al. (2023), and computer vision Verma et al. (2019); Wang et al. (2023).

In the specialized field of speech processing, preliminary studies have explored IPA in speech separation Lam et al. (2020); Alex et al. (2023) and classification tasks Snyder et al. (2017); Liu et al. (2023). However, its application in S2T tasks remains limited and largely unexplored Medennikov et al. (2018); Meng et al. (2021); Cheng et al. (2022); Zhou et al. (2023). The existing work has not yet established clear guidelines on when and how IPA can be optimally leveraged in S2T tasks, leaving a substantial gap in our understanding and application of this promising technique.

In this paper, we examine this question more closely, conducting a series of experiments to answer the following questions:

Q1

What is the appropriate interpolation strategy, and what distinctions arise between interpolating speech features and text embeddings? (§3)
Q2

How can IPA create an effective combination with existing augmentation techniques, such as the well-established method SpecAugment? (§4)
Q3

Are there specific issues in applying IPA to S2T tasks, and how can they be addressed? (§5)
Q4

How does IPA perform across various scenarios? (§6)

By probing these questions, we develop an effective IPA method that achieves consistent improvements across two S2T tasks (including ASR and AST), various architectures (including encoder-decoder and encoder-CTC), and diverse data scales (ranging from LibriSpeech 10h to 960h).

2 Experimental Settings

Data augmentation methods typically demonstrate greater potential in low-resource scenarios. In light of this, we conduct analyses using the LibriSpeech 100h ASR dataset and subsequently apply our findings to various scenarios. We report results mainly on the test-clean and test-other sets. The average word error rate (WER) is calculated on the concatenation of all four subsets.

Various existing data augmentation techniques, such as SpecAugment and speed perturbation, have achieved excellent results. SpecAugment Park et al. (2019), the most widely employed method in S2T tasks, introduces random noise to the input features through time war**, frequency masking, and time masking. Speed perturbation Ko et al. (2015), on the other hand, commonly expands the dataset by generating three variations of raw audio with speed factors of 0.9, 1.0, and 1.1, facilitating its integration. In our work, the goal of IPA is to not only lead to isolate improvements but to also work orthogonally with these methods. Therefore, we first examine scenarios without other augmentations and then explore the effects of their combination.

In the field of S2T, common architectures encompass both encoder-decoder (Enc-Dec) and encoder-CTC (Enc-CTC) designs. The Enc-Dec model consists of an encoder with 12 Conformer layers and a decoder with 6 Transformer layers, each containing 256 hidden units, 4 attention heads, and 2048 feed-forward sizes. Connectionist Temporal Classification (CTC, Graves et al., 2006) multi-task learning is applied on top of the encoder, introducing an additional loss with a weight of 0.3. The Enc-CTC model can be viewed as a variant of the Enc-Dec model, containing only an 18-layer Conformer encoder for comparable parameters of about 30M. It predicts the text purely through CTC, where the weight of the CTC loss is 1. We initially investigate the effects of IPA on the Enc-Dec model before extending the method to the Enc-CTC model. More details about the datasets and model settings are described in Appendix A.

3 Q1: Choice of Interpolation Strategy

In this section, we begin with an overview of the basic implementation of IPA. Subsequently, we investigate the appropriate interpolation strategy tailored specifically for the field of S2T generation.

3.1 Definition of IPA

IPA, commonly known as Mixup Zhang et al. (2017), constructs virtual samples in a vicinal distribution by linearly interpolating both the inputs and labels of two randomly selected samples, thereby enhancing the model’s generalization capability. Considering two samples $(x_{i},y_{i})$ and $(x_{j},y_{j})$ , where $x$ denotes the input features and $y$ represents the corresponding label. IPA assembles the new sample as follows:

	$\displaystyle x_{m}=\lambda\cdot x_{i}+(1-\lambda)\cdot x_{j}$		(1)
	$\displaystyle y_{m}=\lambda\cdot y_{i}+(1-\lambda)\cdot y_{j}$		(2)

where $\lambda\in[0,1]$ is a weighting factor drawn from a Beta distribution $\lambda\sim\textnormal{Beta}(\alpha,\alpha)$ .

A value of $\alpha$ approaching 0 implies that the generated samples closely resemble either $(x_{i},y_{i})$ or $(x_{j},y_{j})$ , while a value of $\alpha$ approaching $+\infty$ leads to a more balanced interpolation between the two. In practical applications, IPA randomly replaces a subset of samples with the interpolated versions in each mini-batch, while leaving the remaining samples untouched. The selection ratio $\gamma$ is typically set to 1, indicating the model is trained completely on the interpolated samples. Both $\alpha$ and $\gamma$ serve as essential hyper-parameters, and finding their optimal values often requires careful empirical exploration.

3.2 IPA Strategy in S2T

Building upon the aforementioned framework, we extend our investigation to the application of IPA within the domain of S2T generation, focusing specifically on ASR and AST tasks.

Let a training sample be denoted as $(s,x,y)$ , where $s$ denotes the speech features, $x$ denotes the transcription of $s$ , and $y$ denotes the translation in the target language in AST, or the transcription in the case of ASR. When employing an Enc-Dec model, the training objectives encompass the utilization of joint CTC loss to model $x$ at the encoder level, coupled with cross-entropy (CE) loss to model $y$ within the decoder. Thus, it can be formulated as:

	$\displaystyle\mathcal{L_{\textnormal{CTC}}}(h,x)\!\!\!$	$\displaystyle=$	$\displaystyle\!\!\!-\log\textnormal{P}_{\textnormal{CTC}}(x\|h;\theta_{Enc})$		(3)
	$\displaystyle\mathcal{L_{\textnormal{CE}}}(h,z,y)\!\!\!$	$\displaystyle=$	$\displaystyle\!\!\!-\log\textnormal{P}_{\textnormal{CE}}(y\|h,z;\theta)$		(4)

where $h$ is the output of the encoder, and $z$ is the input embedding of the decoder. $\theta_{Enc}$ and $\theta$ are the model parameters of the encoder and the whole network. Two hyper-parameters $w_{\textnormal{CTC}}$ and $w_{\textnormal{CE}}$ are introduced to balance CTC and CE loss components:

\displaystyle\mathcal{L}=w_{\textnormal{CTC}}\cdot\mathcal{L_{\textnormal{CTC}% }}+w_{\textnormal{CE}}\cdot\mathcal{L_{\textnormal{CE}}}

(5)

To apply the IPA in S2T tasks, several significant distinctions must be noted when compared to conventional classification tasks:

•

In typical classification models, the architecture usually comprises only the encoder, whereas in the S2T model based on the Enc-Dec architecture, the decoder processes the embedding sequence as input. The feasibility of directly interpolating word embeddings remains an open question.
•

The label in classification tasks often takes the form of a one-hot category, thereby simplifying the interpolation process, while the S2T tasks present a more complex scenario. Specifically, the training objectives for CTC and CE are discrete text sequences, and the method to interpolate and learn the label effectively remains an open question.

To address these challenges, we first design the interpolation strategy grounded in previous studies, followed by an exploration of specific issues. Consider two arbitrary samples in a batch, denoted as $(s_{i},x_{i},y_{i})$ and $(s_{j},x_{j},y_{j})$ . We interpolate the input according to Eq. (1):

\displaystyle s_{m}=\lambda\cdot s_{i}+(1-\lambda)\cdot s_{j}

(6)

where we pad the shorter features with zeros to achieve the same length for interpolation. After obtaining the representation $h_{m}$ outputted by the encoder, we calculate the CTC loss with respect to both labels and interpolate them as follows:

	$\displaystyle\mathcal{L_{\textnormal{CTC}}}(h_{m},x_{i},x_{j})=\lambda\cdot% \mathcal{L_{\textnormal{CTC}}}(h_{m},x_{i})$
	$\displaystyle+(1-\lambda)\cdot\mathcal{L_{\textnormal{CTC}}}(h_{m},x_{j})$		(7)

Employing the widely proven interpolation strategy Zhang et al. (2017) in the encoder is natural due to similar designs. Thereby we focus on the interpolation strategy within the decoder. A straightforward implementation is similar to the operation in the encoder, which involves interpolating the embeddings $z_{i}$ and $z_{j}$ in the input layer of the decoder:

\displaystyle z_{m}=\lambda\cdot z_{i}+(1-\lambda)\cdot z_{j}

(8)

Next, we calculate losses with two labels $y_{i}$ and $y_{j}$ for interpolation. The whole procedure is formalized as:

	$\displaystyle\mathcal{L_{\textnormal{CE}}}(z_{m},h_{m},y_{i},y_{j})=\lambda% \cdot\mathcal{L_{\textnormal{CE}}}(h_{m},z_{m},y_{i})$
	$\displaystyle+(1-\lambda)\cdot\mathcal{L_{\textnormal{CE}}}(h_{m},z_{m},y_{j})$		(9)

For simplicity, we refer to this strategy as embedding interpolation (EIP).

However, the preceding approach may lead to a disparity between training and decoding. During training, the decoder takes the interpolated embedding sequence as input, whereas it receives only a single embedding sequence during inference. To bridge this gap, we investigate an alternative strategy that solely interpolates the encoder input while preserving the original input in the decoder Meng et al. (2021). The loss in this context is calculated as follows:

	$\displaystyle\mathcal{L_{\textnormal{CE}}}(h_{m},z_{i},z_{j},y_{i},y_{j})=% \lambda\cdot\mathcal{L_{\textnormal{CE}}}(h_{m},z_{i},y_{i})$
	$\displaystyle+(1-\lambda)\cdot\mathcal{L_{\textnormal{CE}}}(h_{m},z_{j},y_{j})$		(10)

In summary, two interpolation strategies have distinct characteristics. The first approach leverages simple interpolation operations akin to those in the encoder and contributes to the regularization of the decoder. Conversely, the second approach focuses on consistent modeling, bypassing interpolation in the decoder, and may facilitate more stable learning.

Refer to caption — Table 1: WER of IPA method applied to the Enc-Dec model without SpecAugment on the LibriSpeech 100h dataset.

Method	$\alpha$	$\gamma$	EIP	clean	other	Avg.
Baseline	-	-	-	11.85	30.78	20.81
\cdashline1-7 IPA	0.2	1.0		10.31	25.12	17.37
	2.0	0.3		10.31	26.44	18.00
	2.0	1.0		10.14	22.45	15.99
\cdashline2-7	0.2	1.0	$\surd$	10.29	25.53	17.48
	2.0	0.3		10.40	26.67	18.02
	2.0	1.0		9.91	22.90	16.35

Method	$\alpha$	$\gamma$	EIP	clean	other	Avg.
Baseline	-	-	-	8.51	19.05	13.63
\cdashline1-7 IPA	0.2	0.3		8.45	18.68	13.46
	0.2	1.0		8.75	19.51	13.89
	2.0	0.3		9.19	19.88	14.27
	2.0	1.0		11.01	23.48	16.88
\cdashline2-7	0.2	0.3	$\surd$	8.29	18.97	13.53
	0.2	1.0		8.73	19.39	13.80
	2.0	0.3		8.71	19.01	13.65
	2.0	1.0		10.36	20.24	15.07

Method	$\alpha$	$\gamma$	EIP	clean	other	Avg.
Baseline	-	-	-	8.51	19.05	13.63
\cdashline1-7 AIPA	0.2	0.3		8.13	18.95	13.36
	0.2	1.0		8.01	18.52	13.05
	2.0	0.3		8.26	18.39	13.16
	2.0	1.0		8.48	18.91	13.48
\cdashline2-7	0.2	0.3	$\surd$	8.45	18.72	13.25
	0.2	1.0		7.91	18.14	12.79
	2.0	0.3		8.30	18.57	13.27
	2.0	1.0		7.88	18.17	12.95

	$\displaystyle\mathcal{L_{\textnormal{CTC}}^{\textnormal{COS}}}(h_{m},h)=$	$\displaystyle\!\!-$	$\displaystyle\!\!\!\!\sum_{m=1}^{T}\sum_{k=1}^{\|V\|}\textnormal{P}(\pi_{m}=v^{k% }\|h)$		(12)
		$\displaystyle\!\!\times$	$\displaystyle\!\!\!\log\textnormal{P}(\pi_{m}=v^{k}\|h_{m})$		(12)

	$\displaystyle\mathcal{L}$	$\displaystyle=$	$\displaystyle w_{\textnormal{CTC}}\cdot\mathcal{L_{\textnormal{CTC}}}+w_{% \textnormal{CTC}}^{\textnormal{COS}}\cdot\mathcal{L_{\textnormal{CTC}}^{% \textnormal{COS}}}$		(14)
		$\displaystyle+$	$\displaystyle w_{\textnormal{CE}}\cdot\mathcal{L_{\textnormal{CE}}}+w_{% \textnormal{CE}}^{\textnormal{COS}}\cdot\mathcal{L_{\textnormal{CE}}^{% \textnormal{COS}}}$		(14)

Revisiting Interpolation Augmentation for Speech-to-Text Generation

Abstract

1 Introduction

2 Experimental Settings

3 Q1: Choice of Interpolation Strategy

3.1 Definition of IPA

3.2 IPA Strategy in S2T

4 Q2: Combination of Augmentation Techniques

4.1 Preliminary Results

4.2 Why Does the Combination Fail?

4.3 Appending-based IPA

5 Q3: Resolution of Specific Issues

6 Q4: Effect under Various Scenarios

6.1 Model Architectures

6.2 Data Scales

6.3 Model Backbones

6.4 AST Task

7 Conclusion

Limitations

Acknowledgements

References

Appendix A Experimental Settings

A.1 Datasets and Pre-processing

A.2 Model Settings

A.3 Augmentation Settings

	$\displaystyle\mathcal{L_{\textnormal{CTC}}^{\textnormal{COS}}}(h_{m},h_{i},h_{% j})=\lambda\cdot\mathcal{L_{\textnormal{CTC}}^{\textnormal{COS}}}(h_{m},h_{i})$
	$\displaystyle+(1-\lambda)\cdot\mathcal{L_{\textnormal{CTC}}^{\textnormal{COS}}% }(h_{m},h_{j})$		(13)

Model	dev		test		Avg.
Model	clean	other	clean	other	Avg.
Baseline	8.20	19.13	8.51	19.05	13.63
\cdashline1-6 AIPA	7.56	17.95	7.91	18.14	12.95
+ CTC COS	7.11	17.66	7.49	17.85	12.43
+ CTC COS^∗	7.20	17.74	7.57	17.90	12.51
+ CE COS	7.41	17.92	7.82	17.99	12.69
+ Both COS	7.26	17.75	7.61	17.80	12.52

Dataset	Method	dev		test		Avg.
Dataset	Method	clean	other	clean	other	Avg.
10h	Baseline	35.34	51.89	35.13	53.20	43.74
10h	AIPA	28.34	43.89	28.46	44.76	36.22
50h	Baseline	13.10	28.40	13.48	29.46	21.03
50h	AIPA	10.54	22.50	10.84	23.12	16.64
960h	Baseline	3.47	9.34	3.61	9.02	6.31
960h	AIPA	2.91	7.61	3.01	7.51	5.21

Method	Transformer	Conformer
Baseline	6.06	7.16
+ InterCTC	5.67	5.87
+ PAE	5.32	5.81
\cdashline1-3 AIPA	5.58	6.14
+ CTC COS	5.12	4.55
+ InterCTC	5.15	4.53
+ InterCTC COS	5.05	4.35
+ PAE	4.62	4.27

Method	dev	tst-COMMON
Baseline	25.42	26.31
+ InterCTC	26.35	26.56
+ PAE	26.62	26.62
\cdashline1-3 AIPA	25.85	26.38
+ CTC COS	26.04	26.75
+ CE COS	26.13	26.64
+ Both COS	26.79	26.88
+ InterCTC	26.48	26.68
+ All COS	26.92	27.50
+ PAE	26.69	27.39

Method	dev		test		Avg.
Method	clean	other	clean	other	Avg.
Baseline	9.58	23.07	9.99	23.84	16.50
+ InterCTC	8.18	20.19	8.47	20.73	14.28
+ PAE	8.09	19.85	8.32	20.76	14.15
\cdashline1-6 AIPA	8.77	21.41	9.07	21.75	15.14
+ CTC COS	7.16	17.93	7.39	18.17	12.57
+ InterCTC	7.74	19.73	8.12	20.09	13.82
+ CTC COS	7.03	17.43	7.37	17.80	12.31
+ Both COS	6.73	17.07	6.99	17.35	11.94
+ PAE	6.44	16.49	6.70	16.67	11.49