How to Learn in a Noisy World?
Self-Correcting the Real-World Data Noise on Machine Translation

Yan Meng  Di Wu  Christof Monz
Language Technology Lab
University of Amsterdam
{y.meng, d.wu, c.monz}@uva.nl
Abstract

The massive amounts of web-mined parallel data contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first study the impact of real-world hard-to-detect misalignment noise by proposing a process to simulate the realistic misalignment controlled by semantic similarity. After quantitatively analyzing the impact of simulated misalignment on machine translation, we show the limited effectiveness of widely used pre-filters to improve the translation performance, underscoring the necessity of more fine-grained ways to handle data noise. By observing the increasing reliability of the model’s self-knowledge for distinguishing misaligned and clean data at the token-level, we propose a self-correction approach which leverages the model’s prediction distribution to revise the training supervision from the ground-truth data over training time. Through comprehensive experiments, we show that our self-correction method not only improves translation performance in the presence of simulated misalignment noise but also proves effective for real-world noisy web-mined datasets across eight translation tasks.

How to Learn in a Noisy World?
Self-Correcting the Real-World Data Noise on Machine Translation


Yan Meng  Di Wu  Christof Monz Language Technology Lab University of Amsterdam {y.meng, d.wu, c.monz}@uva.nl


1 Introduction

The success of machine translation (MT) models is mainly due to the availability of large amounts of web-crawled data. However, publicly available web-mined parallel corpora (bitext) such as CCAligned (El-Kishky et al., 2020), WikiMatrix (Schwenk and Douze, 2017), ParaCrawl (Bañón et al., 2020) and NLLB (team et al., 2022) are shown to be noisy (Kreutzer et al., 2022; Ranathunga et al., 2024). The notable performance drop on NMT when training with injected synthetic noise (Khayrallah and Koehn, 2018) or fine-tuning with CCAligned (Lee et al., 2022), indicates the importance of develo** noise-robust training when learning from noisy corpus.

Refer to caption
Figure 1: An illustration of our self-correction method. When the model’s translation is superior than the human reference, e.g., albergo means “hotel” instead of “house”, we self-correct the ground-truth data by the model’s prediction and learn towards the new revised target.

Given a noisy training dataset, a common and straightforward approach to mitigate the impact of noisy data is to filter low-quality training samples (Herold et al., 2022; Bane et al., 2022). However, in practice, large amounts of misaligned samples still exist in pre-filtered web-mined datasets (Kreutzer et al., 2022). Even earlier works (Khayrallah and Koehn, 2018; Herold et al., 2022; Li et al., 2024) study the impact of misalignment on machine translation, but their way of generating synthetic misaligned sentences is by random shuffling, which is unrealistic. The real-world misaligned sentences contain partially shared meanings, which disguise them as seemingly parallel data and are hard-to-detect. To analyze such hard-to-detect real-world misalignment, we design a process to simulate it controlled by semantic similarity. Similar to the real-world setting, we demonstrate that our simulated misalignment also challenges widely-used pre-filters, including Laser (Artetxe and Schwenk, 2018), Comet (Rei et al., 2020) and Bi-Cleaner (Zaragoza-Bernabeu et al., 2022).

Under our simulated noise setting, we evaluate an alternative type of approach to handle data noise: Data truncation (Kang and Hashimoto, 2020; Li et al., 2024; Flores and Cohan, 2024), which focuses on the model training dynamics and ignores losses at the token level when there is a relatively large inconsistency between the model’s prediction and the ground truth during training. However, we observe that truncation methods are sensitive to varying levels of simulated misalignment noise. For example, in low-resource training corpus with a high misalignment rate, truncation methods degrade the translation performance (shown in Section 5.3). We argue that the noisy low-resource setting restricts the model from acquiring sufficient correct knowledge, resulting in an inaccurate removal of clean and useful ground-truth data. Moreover, truncation methods start to ignore potential data noise from an early training time, which overlooks the increasing reliability of the model’s prediction over time.

To overcome these limitations, we propose an approach called self-correction, which leverages the model’s predictions to self-correct noise during training while maintaining supervision from the ground truth to avoid discarding useful training information (as shown in Figure 1). To adapt to the changing reliability of model’s self-knowledge, we set a dynamic schedule to gradually increase the trust in the model’s predictions to revise the training supervision from the ground truth. In the early stage of training, the model is not well trained, so we place greater trust in the reference data than in the model’s prediction. As the model gains knowledge during training, we use it to revise the ground-truth data progressively.

We evaluate our self-correction method in both simulated and real-world environments. We demonstrate that our method consistently outperforms baselines in both high- and low-resource datasets with different levels of misalignment noise. Moreover, we clearly show the gains are mainly from revising the misaligned samples while maintaining the performance of clean parallel data. In the real-world setting, our self-correction method effectively handles naturally occurring noise in web-mined parallel datasets, e.g. ParaCrawl and CCAligned, leading to performance gains of up to 1.2 BLEU points across eight translation tasks and outperforming other alternative methods, including pre-filters and truncation.

2 Background

2.1 The Noisy World

Web-mined parallel corpora are the primary training data source for machine translation models. However, parallel data crawled from public websites lack quality guarantees and contain different types of noise (Kreutzer et al., 2022), including wrong language, non-linguistic content, and misalignment noise.

The primary source of noise in parallel web-mined data is semantic misalignment (Khayrallah and Koehn, 2018; Kreutzer et al., 2022; Ranathunga et al., 2024). For instance, Khayrallah and Koehn (2018) analyzed the data quality of the raw ParaCrawl corpus, showing 77% of the analyzed sentence pairs contain noise while half of them are misalignments. Wrong language and non-linguistic contents only account for a small portion since filters e.g., language identification toolkits (Herold et al., 2022), can easily detect them. Kreutzer et al. (2022) extended the data quality analysis to pre-filtered web-mined datasets, e.g. WikiMatrix, CCAligned, noting that more than 50% of data in both corpora are noisy while misalignment is the primary noise. Even in the top-quality parallel data from the NLLB corpus, many misaligned sentences can still be found (Ranathunga et al., 2024).

Overall, previous studies demonstrate the prevalence of noisy training data in web-mined corpora for machine translation and underscore the importance of noise-robust training, particularly in handling misaligned data.

2.2 Learning in the Noisy World

2.2.1 Pre-Filter

Data filtering is a straightforward way to remove the noise from the translation training corpus to mitigate its negative impact. There are two methods to ensure the semantic alignments between the source and target in a sentence pair: (1) surface-level filters, e.g., removing sentence pairs that differ a lot in source and target length; (2) semantic-level filters, relying on quality estimation models to score each sentence pair (Kepler et al., 2019; Rei et al., 2020; Peter et al., 2023). Other studies consider the denoising as outlier detection (Taghipour et al., 2011) or ranking problem (Cui et al., 2013).

In this paper, we mainly consider semantic-level filters: (1) Laser (Artetxe and Schwenk, 2018), a sentence alignment pre-filter tool for web-mined corpora, i.e., CCAligned and WikiMatrix; (2) Bi-cleaner (Zaragoza-Bernabeu et al., 2022), a classifier that predicts if a sentence is a translation of another. It is also a part of the pre-processing pipeline for generating the ParaCrawl dataset; (3) Comet (Rei et al., 2020), a widely-used quality estimation model for machine translation.

All these approaches rely on external models to select high-quality data before training. However, our approach does not require such pre-selection but focuses on model training dynamics, which can be applied more broadly.

2.2.2 Truncation

There are two ways for removing data noise by focusing on the model’s training dynamics: (1) using extrinsic trusted data to measure the data quality between noisy and denoised NMT models (Wang et al., 2018); (2) solely using the model’s self-knowledge to remove potential noise during training (Kang and Hashimoto, 2020; Li et al., 2024). In our approach, we do not require any external trusted data or models and thus we only compare with the second type of work, i.e., data truncation.

The most common type of truncation is based on the loss, which estimates the data quality by the model’s predicted probability of the ground truth (Kang and Hashimoto, 2020; Goyal et al., 2022; Flores and Cohan, 2024). Ground-truth tokens with high loss are considered as noise and will be skipped during training by setting their loss to zero. However, relying on loss to remove noise overlooks the model’s probability distribution of non-target tokens. In a high entropy context, the probability of the correct ground-truth token is also low due to many alternative continuations, which cannot be treated as noise. To consider the entire model prediction distribution, Li et al. (2024) further propose Error Norm Truncation and use the l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm between the model’s prediction distribution and the one-hot distribution of the ground-truth token to measure the data quality. During training, tokens with error norm values exceeding a pre-defined threshold will be removed.

In this paper, we compare against data truncation methods by relying on two model-based metrics, loss and error norm values. However, truncation methods are sensitive to when and how much data to truncate? Our approach overcomes these limitations by revising all the ground truth with an adaptive trusting weight of the model’s prediction. Details are introduced in Section 4.

3 An Empirical Study of Misalignment

In this section, we investigate the primary source of noise, i.e., semantic misalignment, in a simulated setting. We first introduce a strategy to simulate realistic misalignment noise by controlling semantic similarity (Section 3.1). Next, we show the similarity of our simulated noise to real-world misalignment in terms of adequacy and its hard-to-detect nature (Section 3.2). Under our simulated noisy setting, we evaluate the model-based metrics to distinguish data noise and show the potential limitations (Section 3.3).

en Alcohol poisoning is the biggest cause of death.
nl Jacht is de belangriekste doodsoorzaak.
en:Hunting is the biggest cause of death.
en With Bravofly you can compare the flight prices Santa Cruz De La Palma of over 400 of the most famous airlines in the world.
de Bravofly findet für Sie sämtliche Billigflüge Zürich - Santa Cruz De La Palma der besten europäischen Billigfluggesellschaften.
en: Bravofly finds all the cheap flights Zurich - Santa Cruz De La Palma from the best European low-cost airlines for you.
Table 1: Examples of misaligned sentences in the ParaCrawl dataset. Bold represents the misaligned meanings. Italic text represents the English translation.

3.1 Simulating Misalignment Noise

To simulate misalignment, previous works (Bane et al., 2022; Herold et al., 2022; Li et al., 2024) randomly shuffle target sentences of a clean parallel training corpus. However, the random shuffling noise can be easily removed by pre-filters based on length difference or semantic similarity (Herold et al., 2022), which oversimplifies the misalignment in the real-world web-mined corpora. For instance, Kreutzer et al. (2022) show that a large portion of misaligned sentence pairs that share partial semantics still exists even after pre-filter tools, i.e., Laser, (examples in Table 1) which show the challenge of detecting real-world misalignment.

To quantitatively analyze the impact of the realistic misalignment noise, we design a process to simulate real-world misalignment controlled by semantic similarity (See Algorithm 1 in the Appendix). From a chunk of clean data, we first choose k𝑘kitalic_k misaligned target candidates for a source sentence based on length difference and word overlap from the true parallel target. Next, we use quality estimation methods, e.g., Laser or Comet, to calculate the semantic similarity scores between the source sentence and its misaligned candidates. The candidate with the top score is selected as the final synthetic misaligned target. In this way, the misalignment noise is generated with controlled semantic overlaps. Examples that are controlled by Laser (Misaligned-Laser) and Comet (Misaligned-Comet) are in Appendix 6.

3.2 Real-World Misalignment

Refer to caption
Figure 2: Adequacy (scale: 1–5) scores on simulated and real-world misaligned sentences. The real-world misaligned sentences are selected from ParaCrawl V7.0.
Refer to caption
Figure 3: loss (above) and el2n (below) distribution for clean and misaligned-Laser noise samples during the training process (Epoch = 5, 10, 15, 30). Red distribution represents misaligned-Laser noise and blue distribution represents the clean data.

3.2.1 Adequacy

To show the similarity of our simulated noise to real-world misalignment, we conduct a human evaluation on 50 simulated and real-world misaligned sentences rating their Adequacy (scale 1–5), which measures the meaning overlap between source and target. In Figure 2, we show both the real-world misalignment and simulated misaligned-Laser/Comet has a relatively high adequacy score, above 2.52.52.52.5, while Randomly shuffled misaligned sentences only has an adequacy of 1.51.51.51.5. This ensures our simulated misalignment contains partial semantic overlaps as the real-world misalignment. Moreover, our simulation process confirms the fluency of misaligned targets by selecting sentences from a clean corpus. Details of human evaluation are in Appendix B.3.

3.2.2 Hard-to-Detect Nature

To show the hard-to-detect nature of simulated noise as real-world misalignment, we investigate the noise detection ability of widely used pre-filters: Comet, Laser, and Bi-Cleaner.

We calculate the noise detection accuracy of the data filters on a mixed set with the same amounts of clean and noisy data. For the clean data, we randomly sample 2,000 clean sentence pairs from the WMT2017 De\rightarrowEn training set. For Misaligned-Random, we randomly shuffle the order of target sentences in the sampled clean sentence pairs. For Misaligned-Comet and Misaligned-Laser, we use the same source sentences from the sampled clean data, and select the misaligned targets from another 200K target sentences in the training corpus based on Algorithm 1. We score each sentence pair based on the filter models and determine a true ratio threshold based on the amounts of clean and noisy sentence pairs, here 1:1. Sentence pairs with scores below this threshold are classified as noisy. The details for scoring sentence pairs by filter models are shown in Appendix A.

Type Filter Accuracy
Comet Laser Bi-Cleaner
Misaligned-Random 73% 76% 75%
Misaligned-Laser 52% 40% 52%
Misaligned-Comet 50% 60% 52%
Table 2: Accuracy of data filters when distinguishing different misaligned noise from clean parallel data.

Table 2 shows the filtering accuracy of data filters on different misaligned noise. First, Misaligned-Random can be accurately detected by all data filters, particularly for using Laser, with an accuracy of 76%. This questions the previous assumptions (Khayrallah and Koehn, 2018; Li et al., 2024) of the impact of misalignment noise on translation performance since most of them can be pre-filtered. However, our introduced noise i.e., Misaligned-Laser and Misaligned-Comet pose a challenge on all pre-filters, as the real-world misalignment does. Overall, we show the validity of our simulated noise in two aspects: (1) Adequacy, reflected in the similar level of shared semantics as the real-world misalignment; (2) Hard-to-Detect Nature, reflected in the low filter accuracy of widely used pre-filters.

3.3 Fine-grained Misalignment Detection

Model-based metrics, including loss and error norm values, are used in data truncation methods (Kang and Hashimoto, 2020; Li et al., 2024) to detect and ignore data noise at the token-level during training. Here, we evaluate their effectiveness under our simulated noisy settings.

Loss measures the model’s predicted probability of the ground-truth token while error norm value (el2n) calculates the difference between the ground-truth (one-hot) distribution 𝑂𝐻(yt)𝑂𝐻subscript𝑦𝑡\mathit{OH}(y_{t})italic_OH ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and model’s prediction distribution pθ(|y<t,x)p_{\theta}(\cdot|y_{<t},x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_y start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , italic_x ) (eq 1).

el2n=||pθ(|x,y<t)𝑂𝐻(yt)||2.\mathit{el2n}=||p_{\theta}(\cdot|x,y_{<t})-\mathit{OH}(y_{t})||_{2}.italic_el2n = | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x , italic_y start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) - italic_OH ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (1)

Tokens with higher loss or el2n value are indicated as the noise since they are more inconsistent between the model’s prediction and the ground truth.

To evaluate whether the above model-based metrics can distinguish our simulated noise from clean data, we record the loss and el2n values for each token of the 2,000 clean and Misaligned-Laser sentence pairs in the same data setting of Section 3.2.2. Figure 3 shows the changes of loss and el2n value distribution for the clean and noisy tokens in different training epochs. First, we find both metrics can gradually distinguish clean and noisy data as the training time increases. This indicates the effectiveness of the model’s self-knowledge for distinguishing hard-to-detect data noise. Second, we find that the el2n metric shows better distinguishability than loss, indicating the importance of considering the whole model prediction distribution.

However, we point out two limitations of truncation methods relying on model-based metrics: First, they overlook the increasing reliability of model prediction by removing potential data noise from an early training time. Second, they cannot avoid removing clean but useful data. As shown in Figure 3, there is still an undistinguishing area between noisy and clean distribution.

4 Noise Self-Correction

To overcome the limitations of truncation methods in Section 3.3, we go a step further and propose the self-correction method to gradually increase the trust of model prediction distributions to correct the ground-truth data during training. Overall, our method keeps the supervision signals from the training data to avoid clean training information loss and also progressively trusts a dynamic low entropy state of model prediction distribution to revise the data. Our work is in line with the label correction in computer vision (Wang et al., 2022; Lu and He, 2022).

New Target.

Consider conditional probability models pθ(y|x)subscript𝑝𝜃conditional𝑦𝑥p_{\theta}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) for machine translation. Such models assign probabilities to a target sequence y=(y1,,yT)𝑦subscript𝑦1subscript𝑦𝑇y=\left(y_{1},...,y_{T}\right)italic_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) by factorizing it to the sum of log probabilities of individual tokens yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the vocabulary V𝑉Vitalic_V. At each training iteration, the model learns towards the ground-truth token distribution, one-hot q(yi)𝑞subscript𝑦𝑖q(y_{i})italic_q ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), with a model prediction distribution pθ(|x,y<i)p_{\theta}(\cdot|x,y_{\textless{i}})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ). In self-correction, we leverage the model prediction pθ(|y<i,x)p_{\theta}(\cdot|y_{\textless{i}},x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT , italic_x ) to revise the one-hot distribution q(yi)𝑞subscript𝑦𝑖q(y_{i})italic_q ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with the aim of learning towards a new target q¯(yi)¯𝑞subscript𝑦𝑖\bar{q}(y_{i})over¯ start_ARG italic_q end_ARG ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ):

q¯(yi)=(1λ)q(yi)+λpθ(|x,y<i)\bar{q}(y_{i})=(1-\lambda)q(y_{i})+\lambda{{p_{\theta}(\cdot|x,y_{\textless{i}% })}}over¯ start_ARG italic_q end_ARG ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( 1 - italic_λ ) italic_q ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_λ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) (2)

In this way, the new target q¯(yi)¯𝑞subscript𝑦𝑖\bar{q}(y_{i})over¯ start_ARG italic_q end_ARG ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) not only keeps the original supervision signal from the training data but also the model’s prediction. λ𝜆\lambdaitalic_λ denotes a weighting factor to determine how much to trust the model prediction.

Dynamic Learning Schedule.

We set λ𝜆\lambdaitalic_λ correlated with a learning time function Time(t)Time𝑡\text{Time}(t)Time ( italic_t ) of training iteration t𝑡titalic_t and model entropy H(pθ)𝐻subscript𝑝𝜃H(p_{\theta})italic_H ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ):

λ=(1H(pθ))×Time(t)𝜆1𝐻subscript𝑝𝜃Time𝑡\lambda=(1-H(p_{\theta}))\times\text{Time}(t)\\ italic_λ = ( 1 - italic_H ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) ) × Time ( italic_t ) (3)

For H(pθ)𝐻subscript𝑝𝜃H(p_{\theta})italic_H ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ), the model trusts its own prediction more with a more confident prediction, i.e., lower entropy. For Time(t)Time𝑡\text{Time}(t)Time ( italic_t ), it allows the model to progressively trust its self-knowledge as training goes. We use a schedule (Bengio et al., 2015) to increase Time(t)Time𝑡\text{Time}(t)Time ( italic_t ) as a function of the training iteration t𝑡titalic_t and T𝑇Titalic_T is the number of total iterations.

Time(t)=11+exp(β(tT+α))Time𝑡11𝛽𝑡𝑇𝛼\text{Time}(t)=\frac{1}{1+\exp(\beta(\frac{t}{T}+\alpha))}Time ( italic_t ) = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( italic_β ( divide start_ARG italic_t end_ARG start_ARG italic_T end_ARG + italic_α ) ) end_ARG (4)

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are hyper-parameters.111 We set as α=0.6𝛼0.6\alpha=-0.6italic_α = - 0.6 and β=6𝛽6\beta=-6italic_β = - 6 to control the range of Time(t)(0,1)Time𝑡01\text{Time}(t)\in(0,1)Time ( italic_t ) ∈ ( 0 , 1 ).

In general, at the beginning of training, the model is not well-trained, and a small Time(t)Time𝑡\text{Time}(t)Time ( italic_t ) controls the model to rely more on the ground-truth data than its own prediction. As training progresses, the increasing Time(t)Time𝑡\text{Time}(t)Time ( italic_t ) allows the model to trust more in its own reliable prediction.

Sharpen the Model Prediction.

To overcome the overly uncertain model prediction on learning towards the new target in Equation 2, we sharpen the model prediction distribution by controlling the softmax temperature τ𝜏\tauitalic_τ in pθ¯=exp(zi/τ)j=1Nexp(zj/τ)¯subscript𝑝𝜃subscript𝑧𝑖𝜏superscriptsubscript𝑗1𝑁subscript𝑧𝑗𝜏\bar{p_{\theta}}=\frac{\exp(z_{i}/\tau)}{\sum_{j=1}^{N}\exp(z_{j}/\tau)}over¯ start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG = divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) end_ARG. We control τ𝜏\tauitalic_τ in a fixed way and a dynamic way to vary it inversely with Time(t)Time𝑡\text{Time}(t)Time ( italic_t ). Therefore, τ𝜏\tauitalic_τ gradually decreases as training time goes: a higher τ𝜏\tauitalic_τ at the early stage of training can prevent the model from converging and a smaller τ𝜏\tauitalic_τ in the later stage makes the model more confident in its output. In Section 5, we compare the performance of both fixed222We use fixed τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 followed by Wang et al. (2022). and dynamic τ𝜏\tauitalic_τ to self-correct the data noise, and also show the impact of varying τ𝜏\tauitalic_τ in Section 5.2.3.

Training.

After acquiring a new target q¯(yi)¯𝑞subscript𝑦𝑖\bar{q}(y_{i})over¯ start_ARG italic_q end_ARG ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), derived from both the true data and peakier model prediction distributions, we will obtain a new training objective based on maximum likelihood estimation (MLE). The following loss function is minimized for every training token over the training corpus D𝐷Ditalic_D:

Lθ(x,y)=𝔼yiD[q¯(yi)logpθ(|y<i,x)]L_{\theta}(x,y)=\mathbb{E}_{y_{i}\sim{D}}\left[-\bar{q}(y_{i})\log{{p}_{\theta% }(\cdot|y_{\textless{i}}},x)\right]italic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) = blackboard_E start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_D end_POSTSUBSCRIPT [ - over¯ start_ARG italic_q end_ARG ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT , italic_x ) ] (5)
Misaligned-Laser Misaligned-Comet Raw-Crawl Data
10% 30% 50% 10% 30% 50% 10% 30% 50%
\cdashline1-11 Baseline with noise 33.0 31.7 30.5 33.1 32.0 30.0 33.0 31.5 29.6
Oracle w/o noise 33.3 32.7 32.0 33.3 32.7 32.0 33.3 32.7 32.0
\cdashline1-11 Pre-Filter Laser 33.2 31.4 30.0 33.1 32.6 30.2 33.0 31.8 30.0
Comet 32.9 31.5 30.4 33.0 31.7 29.6 32.4 31.6 28.5
\cdashline1-11 Truncation loss-frac 33.1 31.4 30.7 33.0 31.2 29.8 33.0 31.8 29.9
el2n-frac 33.0 31.9 31.0 32.0 31.8 29.0 33.0 31.6 30.0
el2n-threshold 33.3 32.7 31.1 33.3 32.3 29.8 33.0 31.7 30.2
\cdashline1-11 Self-Correction (Ours) fixed τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 33.1 32.9* 31.3* 33.2 32.4* 30.4* 33.4* 31.7* 30.3*
dynamic τ𝜏\tauitalic_τ 33.5* 32.3* 31.4* 33.3* 32.5* 30.6* 33.5* 31.9* 30.4*
Table 3: BLEU scores of high-resource De \rightarrow En translation task with different types of noise. The BLEU score of the full clean training corpus (5.8M) De \rightarrow En is 33.5. Baseline with noise: represents the translation performance when injecting with 10%, 30%, 50% of data noise. Oracle w/o noise: represents the upper-bound translation performance when training with the remaining clean data, specifically 90%, 70%, 50% of the data excluding the noise. Bold and Underline represents the best and second best score. * signifies that our self-correction method is significantly better (p-value < 0.05) than the baseline. The statistical significance results with paired bootstrap resampling are followed by (Koehn, 2004). Chrf++ and Comet score are provided in Table 9.
Misaligned-Laser Misaligned-Comet Raw-Crawl Data
10% 30% 50% 10% 30% 50% 10% 30% 50%
\cdashline1-11 Baseline with noise 22.3 20.0 18.0 21.4 18.7 14.2 22.3 21.0 19.0
Oracle w/o noise 22.3 21.0 20.8 22.3 21.0 20.8 22.3 21.0 20.8
\cdashline1-11 Pre-Filter Laser 22.0 18.7 17.0 21.1 18.9 16.3 21.0 21.2 19.2
Comet 22.0 20.0 17.6 21.0 18.6 13.8 22.2 20.9 18.9
\cdashline1-11 Truncation loss-frac 22.1 20.5 17.9 20.0 17.2 14.2 22.2 21.1 19.1
el2n-frac 22.0 20.5 18.2 21.1 18.9 14.3 22.0 21.3 19.2
el2n-threshold 22.0 20.6 19.4 22.0 18.9 14.5 22.3 22.0 19.1
\cdashline1-11 Self-Correction (Ours) fixed τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 22.4 21.2* 19.8* 21.7* 19.0* 15.3* 22.5* 21.5* 19.9*
dynamic τ𝜏\tauitalic_τ 22.3 20.3* 20.2* 22.1* 19.6* 16.2* 22.3* 21.9* 19.6*
Table 4: BLEU scores of low-resource En \rightarrow Si translation task with different types of noise. The BLEU score of full clean training corpus (0.9M) En \rightarrow Si is 22.5. Chrf++ and Comet score are provided in Table 10.

5 Experiments

In this section, we investigate the effectiveness of our self-correction method for translation in two experiment settings: (1) simulated noise (Section 5.2), and (2) real-world noise (Section 5.3). For simulated noise, we conduct experiments by injecting two types of noise, raw-crawl data and simulated misaligned noise, into clean translation data. For the real-world noise, we perform experiments on web-mined datasets, i.e., ParaCrawl and CCAligned, across eight language pairs.

5.1 Comparing Systems

We compare our self-correction method with the following comparing systems. Note that all the models’ details are in line with the corresponding baselines. For pre-filter methods, the training data size is determined by the number of sentence pairs to filter. For data truncation and our self-correction method, both of them use the full training corpus.

Pre-Filtering.

We select two widely-used data filters: Laser and Comet. We rank the training sentence pairs based on the scores calculated by the filter models. For the simulated noise experiments (Section 5.2), we filter out the sentence pairs with the lowest scores before training, matching the size to the injected data noise. For the real-world noise experiments (Section 5.3), we filter out 20% of the sentence pairs with the lowest scores.

Truncation.

We compare two truncation methods: (1) loss truncation (Kang and Hashimoto, 2020), (2) error norm value (el2n) truncation (Li et al., 2024). Following (Li et al., 2024), we choose the best result among three truncation fractions {0.05, 0.1, 0.2} for both loss-fraction and el2n-fraction truncation. We select the best result among three threshold values {1.3, 1.35, 1.4} for el2n-threshold truncation.333Followed by (Li et al., 2024), the starting training iteration to truncate data is set as 1500.

5.2 Simulated Noisy World

5.2.1 Experiment Setup

We conduct experiments on both high- and low-resource translation tasks. We use the WMT2017 (German) De\rightarrowEn news translation data as the high-resource task and En\rightarrowSi (Sinhala) from OPUS444https://opus.nlpl.eu/ as the low-resource task. Following Herold et al. (2022), we inject noise by replacing a portion (10%, 30%, 50%) of the clean training corpus with simulated misalignment noise or raw crawl data. The misalignment noise is generated by Algorithm 1 from the replaced portion of the clean corpus. Each translation task randomly selects the raw crawl data from the raw Paracrawl corpus555https://paracrawl.eu/. The raw data contains a mixture of naturally occurring noise, including misaligned sentences, wrong language, grammar errors, etc, and provides a realistic testbed for noise-handling methods.

Our translation model uses the fairseq (Ott et al., 2019) implementation of the Transformer-Big architecture for the high-resource task and Transformer-Base for the low-resource task. The full training details are shown in Appendix C.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Figures (Left/Middle): BLEU scores for misaligned-Laser (Left) and clean (Middle) data from baseline and self-correction models trained on De\rightarrowEn task with 30% injected misaligned-Laser. Figure (Right): BLEU scores from the self-correction models on De\rightarrowEn task with 30% different types of injected noise with varying τ𝜏\tauitalic_τ.

5.2.2 Results

Tables 3 and 4 show the high-resource De\rightarrowEn and the low-resource En\rightarrowSi translation performance trained on the corpus with misalignment noise or raw crawl data. Overall, the baseline model’s BLEU scores gradually drop with increasing levels of both simulated and real-world noise.

When applying pre-filter methods under simulated noisy settings, they can achieve modest gains for both simulated misalignment and the raw-crawl data. Compared with the upper-bound oracle results when the model trained with the remaining correct clean data, pre-filters notably fall short, especially in the highly (50%) noisy setting. This performance discrepancy underscores the challenge of pre-filters in accurately distinguishing simulated misalignment noise from clean data, aligning with our findings on the low filter accuracy for misalignment noise in Table 2.

For truncation methods, el2n truncation shows better performance gains than loss truncation, indicating the effectiveness of using the whole model prediction distribution to estimate data quality rather than the sole prediction probability of the ground-truth token. However, with an increasing level of noise, el2n truncation can only achieve minimal or even no improvement, showing its limitation in distinguishing clean and noisy tokens in a highly noisy training environment. In this case, removing tokens with larger el2n values leads to a loss of clean and important training data.

Instead, our self-correction method consistently improves both low- and high-resource translation performance under different controlled simulated noise settings and outperforms other systems in most scenarios. Even when injecting 50% of noise, e.g., misaligned-Laser, our self-correction method still improves over 2 BLEU points over the low-resource baseline, and nearly 1 BLEU points over the high-resource baseline. This shows the necessity of kee** the human reference information for the data sensitive scenarios, and the idea of “revising” instead of “removing” can achieve this goal. Moreover, instead of solely relying on flawed human reference, our method demonstrates the effectiveness of using the system-generated translation. This finding supports the hypothesis on the limited effectiveness on training models exclusively towards the reference especially when it is inferior to the model generated translation (Xu et al., 2024).

5.2.3 Analysis

The Impact of Self-Correction on Clean vs. Noisy Data.

Here, we aim to analyze whether the self-correction method revises the noisy samples and also maintains the performance of clean parallel data. We sample 1K clean and misaligned-Laser sentence pairs from the De\rightarrowEn training corpus and record their BLEU scores during different training epochs from baseline and self-correction model. For the misaligned-Laser, we calculate the BLEU scores according to their parallel true references. Figure 4 (Left) shows that self-correction model can correct the misalignment noise by improving BLEU points compared to the baseline model. In the meantime, the self-correction model maintains the clean parallel data performance as the baseline model without hurting the clean training information, shown in Figure 4 (Middle). This further explains the superiority of self-correction method than pre-filter or data truncation from avoiding to discard the clean and useful training data.

The Impact of Sharpening Model Prediction.

Here, we aim to analyze the impact of sharpening model prediction distribution to correct the ground truth on translation performance. We train the self-correction models on De\rightarrowEn task with 30% of different types of noise, with varying values of softmax temperature τ𝜏\tauitalic_τ. From figure 4 (right), we show that using sharpening model prediction distribution with a smaller τ𝜏\tauitalic_τ achieves better translation performance for all noisy settings. However, the optimal τ𝜏\tauitalic_τ varies when training with different types of noise, and thus increases the difficulty of selecting a fixed τ𝜏\tauitalic_τ for different scenarios. This motivates us to design a dynamic τ𝜏\tauitalic_τ, which varies automatically in a low range of entropy state over training time. The overall performance in both Section 5.2 and Section 5.3 by using a dynamic τ𝜏\tauitalic_τ also shows its general applicability for different noise scenarios.

en\rightarrowfr\heartsuit en\rightarrowru\heartsuit en\rightarrowtr en\rightarrowes en\rightarrowbe en\rightarrowht en\rightarrowsi\heartsuit en\rightarrowkm\heartsuit Avg.
Misaligned Rate (%) 10% 19% 44% 22% 10% 35% 62% 18% -
Corpus Size (M) 5M 5M 5M 5M 1.1M 0.55M 0.21M 0.06M -
Baseline 41.1 26.5 23.5 21.6 9.9 17.0 7.0 4.2 18.8
\cdashline1-10 Pre-Filter Laser 41.8 26.8 23.2 22.5 9.8 17.1 6.6 3.8 19.0
Comet 41.6 26.4 23.7 22.2 9.6 16.9 6.8 4.0 18.9
\cdashline1-10 Truncation loss-frac 41.2 26.0 23.8 21.9 9.8 16.9 6.0 4.0 18.7
el2n-frac 41.3 26.1 23.9 22.0 10.0 16.8 6.0 4.5 18.8
el2n-threshold 41.2 26.3 24.1 22.2 10.2 17.1 6.2 4.2 18.9
\cdashline1-10 Self-Correction (Ours) fixed τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 41.9* 26.6 23.4 21.9* 10.1* 17.2* 7.6* 4.6* 19.2*
dynamic τ𝜏\tauitalic_τ 42.3* 26.7* 24.2* 22.8* 10.5* 17.4* 7.8* 5.0* 19.6*
Table 5: BLEU scores on real-world web-mined corpora. Bold and Underline represents the best and second best score. † denotes language pairs from CCAligned V1.0. \heartsuit denotes language pairs from ParaCrawl V7.1. * signifies that our self-correction method is significantly better (p-value < 0.05) than the baseline. The misaligned noise rate for different language pairs is reported from Kreutzer et al. (2022). Chrf++ and Comet score are provided in Table 11.

5.3 Real Noisy World

5.3.1 Experiment Setup

We investigate two noisy web-crawled datasets: Paracrawl V7.1 and CCAligned V1.0. These two datasets exhibit varying levels of semantic misalignment rates across different language pairs (Kreutzer et al., 2022). For each dataset, we select four language pairs from high- to low-resource with varying levels of misaligned noise rate. For Paracrawl, the language pairs are: en\rightarrowfr (French), en\rightarrowru (Russian), en\rightarrowsi (Sinhala), and en\rightarrowkm (Khmer). For CCAligned, the language pairs are: en\rightarrowtr (Turkish), en\rightarrowes (Spanish), en\rightarrowbe (Belarusian), and en\rightarrowht (Haitian). For the high-resource language pairs: en\rightarrowfr, en\rightarrowru, en\rightarrowtr, en\rightarrowes, we randomly sample 5M sentence pairs as the training corpus. For medium and low-resource language pairs, we use the original corpus size. The validation and test sets for all tasks are from Flores101.666https://github.com/facebookresearch/flores We train on the Transformer-Big (Vaswani et al., 2017) architecture for all tasks.

5.3.2 Results

Table 5 shows the translation performance for two noisy web-crawled datasets, CCAligned V1.0 and Paracrawl V7.1, across eight language pairs with varying corpus size and misaligned rates.

Overall, our self-correction method consistently outperforms other alternative methods, including pre-filter and data truncation, with up to 1.2 BLEU and 2.0 ChrF++ points improvement over the baseline. In line with the findings in the simulated noisy experiment 5.2, we also show that pre-filter and truncation methods can achieve performance gains for high-resource tasks while limited in the low-resource tasks with high misalignment rate e.g., en\rightarrowht, en\rightarrowsi and en\rightarrowkm. Both of the pre-filter and truncation cannot help and even hurt the low-resource translation performance. This further emphasizes the importance of kee** ground-truth supervision for the real-world noisy low-resource tasks. For our self-correction approach, it shows superior performance especially for low-resource tasks with high misaligned rates, indicating its effectiveness in handling real-world data noise in different scenarios. Apart from data pruning, our method provides new insight on handling text noise by correcting particularly for low-resource tasks.

6 Conclusion

In this paper, we aim to address the data quality issues in the web-mined parallel corpus by proposing a noise-handling self-correction approach for machine translation tasks. Our approach focuses on the translation model’s training dynamics and revises the training supervision from the ground-truth data by the model’s prediction distribution.

To evaluate our approach in different noisy settings, we take a deep look at the primary noise source in web-mined parallel data, i.e., semantic misalignment. To quantitatively analyze the impact of real-world misalignment, we propose a process to simulate the misalignment controlled by semantic similarity. Under our simulated noisy training environment, we demonstrate that our self-correction method consistently improves the baselines with different levels of misalignment noise. Moreover, we also show that our self-correction method can effectively handle naturally occurring noise in the real-world noisy web-mined datasets across eight translation tasks, outperforming other alternative approaches, including pre-filter and data truncation. Our work also provides a critical finding on the effectiveness of leveraging the system-generated translation instead of solely relying on flawed reference data.

7 Limitation

First, we acknowledge that our work aims at learning with noisy training corpus, therefore, improvements on high-quality training datasets might be limited. Second, our work focuses on “training from scratch” for translation models, while future works can apply our method to more NLP tasks and settings, including the pre-training or fine-tuning stage for text generation. Third, we only focus on the primary source of noise, i.e., misaligned sentences, on machine translation. Even we show the effectiveness of our approach on the real-world raw crawl data which contains naturally occurring data noise, future work can fine-grained evaluate our approach on other types of noise e.g., wrong language or misordered sentences.

Acknowledgments

This research was funded in part by the Netherlands Organization for Scientific Research (NWO) under project number VI.C.192.080. We would like to thank Vlad Niculae, David Stap, Sergey Troshin and Evgeniia Tokarchuk for their useful feedback and discussion.

References

  • Artetxe and Schwenk (2018) Mikel Artetxe and Holger Schwenk. 2018. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
  • Bane et al. (2022) Fred Bane, Celia Soler Uguet, Wiktor Stribiżew, and Anna Zaretskaya. 2022. A comparison of data filtering methods for neural machine translation. In Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track), pages 313–325, Orlando, USA. Association for Machine Translation in the Americas.
  • Bañón et al. (2020) Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, and Jaume Zaragoza. 2020. ParaCrawl: Web-scale acquisition of parallel corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4555–4567, Online. Association for Computational Linguistics.
  • Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  • Cui et al. (2013) Lei Cui, Dongdong Zhang, Shujie Liu, Mu Li, and Ming Zhou. 2013. Bilingual data cleaning for SMT using graph-based random walk. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 340–345, Sofia, Bulgaria. Association for Computational Linguistics.
  • El-Kishky et al. (2020) Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, and Philipp Koehn. 2020. CCAligned: A massive collection of cross-lingual web-document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5960–5969, Online. Association for Computational Linguistics.
  • Flores and Cohan (2024) Lorenzo Jaime Flores and Arman Cohan. 2024. On the benefits of fine-grained loss truncation: A case study on factuality in summarization. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 138–150, St. Julian’s, Malta. Association for Computational Linguistics.
  • Goyal et al. (2022) Tanya Goyal, Jiacheng Xu, Junyi Jessy Li, and Greg Durrett. 2022. Training dynamics for text summarization models. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2061–2073. Association for Computational Linguistics.
  • Herold et al. (2022) Christian Herold, Jan Rosendahl, Joris Vanvinckenroye, and Hermann Ney. 2022. Detecting various types of noise for neural machine translation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2542–2551, Dublin, Ireland. Association for Computational Linguistics.
  • Kang and Hashimoto (2020) Daniel Kang and Tatsunori B. Hashimoto. 2020. Improved natural language generation via loss truncation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 718–731, Online. Association for Computational Linguistics.
  • Kepler et al. (2019) Fabio Kepler, Jonay Trénous, Marcos Treviso, Miguel Vera, and André F. T. Martins. 2019. OpenKiwi: An open source framework for quality estimation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 117–122, Florence, Italy. Association for Computational Linguistics.
  • Khayrallah and Koehn (2018) Huda Khayrallah and Philipp Koehn. 2018. On the impact of various types of noise on neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 74–83, Melbourne, Australia. Association for Computational Linguistics.
  • Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA.
  • Koehn (2004) Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395, Barcelona, Spain. Association for Computational Linguistics.
  • Kreutzer et al. (2022) Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, and Mofetoluwa Adeyemi. 2022. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  • Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  • Lee et al. (2022) En-Shiun Lee, Sarubi Thillainathan, Shravan Nayak, Surangika Ranathunga, David Adelani, Ruisi Su, and Arya McCarthy. 2022. Pre-trained multilingual sequence-to-sequence models: A hope for low-resource language translation? In Findings of the Association for Computational Linguistics: ACL 2022, pages 58–67, Dublin, Ireland. Association for Computational Linguistics.
  • Li et al. (2024) Tianjian Li, Haoran Xu, Philipp Koehn, Daniel Khashabi, and Kenton Murray. 2024. Error norm truncation: Robust training in the presence of data noise for text generation models. In The Twelfth International Conference on Learning Representations.
  • Lu and He (2022) Yangdi Lu and Wenbo He. 2022. Selc: Self-ensemble label correction improves learning with noisy labels. In International Joint Conference on Artificial Intelligence.
  • Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Peter et al. (2023) Jan-Thorsten Peter, David Vilar, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, and Markus Freitag. 2023. There’s no data like better data: Using QE metrics for MT data filtering. In Proceedings of the Eighth Conference on Machine Translation, pages 561–577, Singapore. Association for Computational Linguistics.
  • Ranathunga et al. (2024) Surangika Ranathunga, Nisansa De Silva, Velayuthan Menan, Aloka Fernando, and Charitha Rathnayake. 2024. Quality does matter: A detailed look at the quality and utility of web-mined parallel corpora. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 860–880, St. Julian’s, Malta. Association for Computational Linguistics.
  • Rei et al. (2020) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  • Schwenk and Douze (2017) Holger Schwenk and Matthijs Douze. 2017. Learning joint multilingual sentence representations with neural machine translation. In Rep4NLP@ACL.
  • Taghipour et al. (2011) Kaveh Taghipour, Shahram Khadivi, and Jia Xu. 2011. Parallel corpus refinement as an outlier detection algorithm. In Proceedings of Machine Translation Summit XIII: Papers, Xiamen, China.
  • team et al. (2022) Nllb team, Marta Ruiz Costa-jussà, James Cross, Onur cCelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Alison Youngblood, Bapi Akula, Loïc Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon L. Spruit, C. Tran, Pierre Yves Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzm’an, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2022. No language left behind: Scaling human-centered machine translation. ArXiv, abs/2207.04672.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc.
  • Wang et al. (2018) Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji Nakagawa, and Ciprian Chelba. 2018. Denoising neural machine translation training with trusted data and online data selection. In Conference on Machine Translation.
  • Wang et al. (2022) Xinshao Wang, Yang Hua, Elyor Kodirov, Sankha Subhra Mukherjee, David A. Clifton, and Neil Martin Robertson. 2022. Proselflc: Progressive self label correction towards a low-temperature entropy state. bioRxiv.
  • Xu et al. (2024) Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young ** Kim. 2024. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. Preprint, arXiv:2401.08417.
  • Zaragoza-Bernabeu et al. (2022) Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Marta Bañón, and Sergio Ortiz Rojas. 2022. Bicleaner AI: Bicleaner goes neural. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 824–831, Marseille, France. European Language Resources Association.

Appendix A Data Filters

Comet is a neural framework for training machine translation evaluation models that can function as metrics Rei et al. (2020). Their framework uses cross-lingual pre-trained language modeling that exploits information from both source input and target reference to predict the target translation quality. We use the reference-free wmt-20-qe-da Comet model as the data filter to score each sentence pair in the training corpus.

For Laser, it scores sentence pairs based on cross-lingual sentence embeddings. To calculate the Laser score for each sentence pair, we generate cross-lingual sentence embeddings using the pre-trained model from Artetxe and Schwenk (2018). The underlying system is trained as a multilingual translation system with a multi-layer bidirectional LSTM encoder and an LSTM decoder, without information about the input language on the encoder. The output vectors of the encoder are compressed into a single embedding of fixed length using max-pooling, which is the cross-lingual sentence embedding resulting from the Laser model. The assumption is that two sentences with the same meaning but from different languages will be mapped onto the same embedding vectors. We calculate the cosine similarity score between source and target sentence embeddings as the Laser score. The higher the Laser score, the more semantic similar between the source and target sentence.

For Bi-Cleaner, it is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations. Sentence pairs considered high-quality are scored near with 1 and considered as noisy are scored with 0. We use the multilingual model bitextor/bicleaner-ai-full-en-xx from HuggingFace777https://huggingface.co/bitextor/bicleaner-ai-full-en-xx for the pre-filter for all language tasks.

Appendix B Controlled Generated Misaligned Noise

B.1 Algorithm

Algorithm 1 shows the method to generate misaligned noise, controlled by two steps: (1) surface-level features control by word overlap and sentence length; (2) quality control by Laser or Comet.

Algorithm 1 Misaligned Noise Generation

Input: A chunk of parallel and de-duplicate clean data D𝐷Ditalic_D with N sentence pairs, source and target (S,T)𝑆𝑇(S,T)( italic_S , italic_T ); A threshold k𝑘kitalic_k for selecting misaligned candidates; A quality controlled model M{Laser,Comet}𝑀LaserCometM\in\{\text{Laser},\text{Comet}\}italic_M ∈ { Laser , Comet }
Output: Misaligned data D¯¯𝐷\bar{D}over¯ start_ARG italic_D end_ARG with N sentence pairs source and misaligned target (S,T¯)𝑆¯𝑇(S,\bar{T})( italic_S , over¯ start_ARG italic_T end_ARG ).

for each source sentence sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in S𝑆Sitalic_S do
     Step 1: Surface-level Features Control
     Initialize a list L𝐿Litalic_L of misaligned candidates for sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
     for each target sentence tj(ji)subscript𝑡𝑗𝑗𝑖t_{j(j\neq i)}italic_t start_POSTSUBSCRIPT italic_j ( italic_j ≠ italic_i ) end_POSTSUBSCRIPT in T𝑇Titalic_T do
         if len(L)<klen𝐿𝑘\operatorname{len}(L)<kroman_len ( italic_L ) < italic_k then
              if |len(tj)len(si)|<3lensubscript𝑡𝑗lensubscript𝑠𝑖3|\operatorname{len}(t_{j})-\operatorname{len}(s_{i})|<3| roman_len ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - roman_len ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | < 3 and word overlap ratio(tj,ti)>0.4word overlap ratiosubscript𝑡𝑗subscript𝑡𝑖0.4\text{word overlap ratio}(t_{j},t_{i})>0.4word overlap ratio ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > 0.4 then
                  Append tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to list L𝐿Litalic_L
              end if
         end if
     end for
     Step 2: Quality Control
     Initialize a quality score list Q𝑄Qitalic_Q
     for each candidate tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in L𝐿Litalic_L do
         score(si,tn)=M(si,tn)scoresubscript𝑠𝑖subscript𝑡𝑛𝑀subscript𝑠𝑖subscript𝑡𝑛\operatorname{score}(s_{i},t_{n})=M(s_{i},t_{n})roman_score ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_M ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
         Append scorescore\operatorname{score}roman_score to list Q𝑄Qitalic_Q
     end for
     Select tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from L𝐿Litalic_L with the highest score in Q𝑄Qitalic_Q
     Append the pair (si,tk)subscript𝑠𝑖subscript𝑡𝑘(s_{i},t_{k})( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) to the misaligned data D¯¯𝐷\bar{D}over¯ start_ARG italic_D end_ARG
end for

B.2 Misaligned Noise Samples

Table 6 shows the simualted misaligned samples of misaligned-laser and misaligned-comet. Overall, the simulated misaligned noise controlled by external models all share certain amounts of semantic meanings compared with the true reference.

src der Rat kam überein, dass die Kommission die Anwendung dieser Verordnung mit dem Ziel überwacht, etwaige Probleme möglichst schnell festzustellen und zu regeln.
\cdashline1-2 ref the Council agreed that the Commission will keep under review the implementation of this Regulation with a view to detecting and addressing any difficulties as soon as possible.
\cdashline1-2 Mis-Laser the Commission has therefore acted wisely in exploring every possible avenue to guard against any difficulties and to prepare for any eventualities.
src Brüssel , 17 März 2015
\cdashline1-2 ref Brussels , 17 March 2015
Mis-Laser Brussels , 4 May 2011
src wann möchten Sie im Aeolos Hotel übernachten?
\cdashline1-2 ref when would you like to stay at the Aeolos Hotel?
\cdashline1-2 Mis-Laser when would you like to stay at the Leenane Hotel?
src buchen Sie Ihre Unterkunft in Edinburgh today!
\cdashline1-2 ref book your accommodation in Edinburgh today!
\cdashline1-2 Mis-Comet book your accommodation in Amsterdam today!
src wir akzeptieren folgende Kreditkarten:Visa, Maestro, Master Card, American Express, JBC, Dinners Club.
\cdashline1-2 ref We accept the following credit cards: Visa, Maestro, Master Card, American Express, JBC, Dinners Club.
Mis-Comet we accept payments by credit card (Visa, MasterCard, Diners Club), Paypal or transfer.
src Puchacz Puchacz Spa befindet sich in Niechorze , in einer schönen und malerischen Umgebung , ist lediglich 150m vom Meer entfernt und liegt in der Nähe des Liwia Łuża Sees .
ref Puchacz Puchacz Spa is located in Niechorze, in a beautiful and picturesque setting, only 150m from the sea and close to Lake Liwia Łuża.
Mis-Comet the Country Hotel Sa Talaia, surrounded by beautiful gardens is located close to San Antonio city and not far away from the historic city of Ibiza
Table 6: Simulated Misaligned Sentences Samples

B.3 Human Evaluation of Misaligned Noise

The aim of this data annotation is to evaluate the adequacy of real-world and simulated misaligned noise. Given a source and misaligned target, the annotator should select the overlap meanings between them, as shown in Table 7. The simulated misaligned sentence pairs are constructed from the clean corpus WMT2017 De\rightarrowEn and the real-world misaligned sentences are selected from web-mined Paracrawl datasets. The language pair we select here is German into English, and therefore we select the human annotator who are sufficient in both German and English languages.

Questionnaire
Whether this target translation conveys the same meanings as the source sentence?
\circ all meanings \circ most meanings \circ much meanings \circ little meanings no meanings
Table 7: Questionnaire for human evaluation, where \circ indicate single-item selection. From all meanings to no meanings, the adequacy score scales from 5–1.
Translation Task Training Source Dev Set Test Set
De\rightarrowEn WMT2017 (5.8M) NewsTest2016 NewsTest2017
En\rightarrowSi OPUS (0.9M) OPUS OPUS
Table 8: The clean training corpus and evaluation dataset details for the simulated noise experiments.

Appendix C Training Details.

C.1 Training and Evaluation

We follow the setup of the Transformer-base and Transformer-big models (Bengio et al., 2015). For each model, the number of layers in the encoder and in the decoder is N=6𝑁6N=6italic_N = 6. For Transformer-base, we employ h=88h=8italic_h = 8 parallel attention layers or heads. The dimensionality of input and output is dmodel=512subscript𝑑model512d_{\text{model}}=512italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT = 512, and the inner layer of feed-forward networks has dimensionality dff=2048subscript𝑑ff2048d_{\text{ff}}=2048italic_d start_POSTSUBSCRIPT ff end_POSTSUBSCRIPT = 2048. For Transformer-big, we employ h=1616h=16italic_h = 16 parallel attention layers or heads. The dimensionality of input and output is dmodel=1024subscript𝑑model1024d_{\text{model}}=1024italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT = 1024, and the inner layer of feed-forward networks has dimensionality dff=4096subscript𝑑ff4096d_{\text{ff}}=4096italic_d start_POSTSUBSCRIPT ff end_POSTSUBSCRIPT = 4096.

All models are trained with the Adam optimizer (Kingma and Ba, 2015) for up to 500K steps for high-resource tasks and 100K steps for low-resource tasks, with a learning rate of 5e-4 and an inverse square root scheduler. A dropout rate of 0.3 and label smoothing of 0.2 are used. Each model is trained on one NVIDIA A6000 GPU with a batch size of 25K tokens. We choose the best checkpoint according to the average validation loss of all language pairs. The data is tokenized with the SentencePiece tool (Kudo and Richardson, 2018) and we build a shared vocabulary of 32K tokens. For evaluation, we employ beam search decoding with a beam size of 5. BLEU scores are computed using detokenized case-sensitive SacreBLEU. 888nrefs:1—case:mixed—eff:no—tok:13a—smooth:exp—version:2.3.1

C.2 Dataset Details

Table 8 shows the training and evaluation dataset details for clean training corpus in simulated noisy experiments in Section 5.2.

Appendix D Chrf++ and Comet Scores

Table 9, 10, and 11 shows the Comet and Chrf++ scores for all experiments.

COMET
\cdashline1-11 Misaligned-Laser Misaligned-Comet Raw-Crawl Data
10% 30% 50% 10% 30% 50% 10% 30% 50%
\cdashline1-11 Baseline with noise 77.8 77.0 76.1 77.6 76.5 75.5 77.9 77.1 75.8
Oracle w/o noise 79.5 79.0 78.6 79.5 79.0 78.6 79.5 79.0 78.6
\cdashline1-11 Pre-Filter Laser 78.0 76.9 75.6 78.2 78.0 76.0 78.0 77.8 76.9
Comet 77.9 77.5 76.3 77.5 76.3 74.0 78.0 76.8 75.6
\cdashline1-11 Truncation loss-frac 78.3 76.5 76.2 78.0 76.3 75.0 78.0 77.2 76.6
el2n-frac 78.3 78.3 76.5 78.1 76.1 76.0 78.2 77.5 76.2
el2n-threshold 78.5 78.7 76.7 78.1 76.6 75.4 78.5 77.9 76.6
\cdashline1-11 Self-Correction (Ours) fixed τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 79.0 78.5 76.8 78.5 77.6 76.2 78.8 78.1 76.5
dynamic τ𝜏\tauitalic_τ 79.1 78.6 77.0 78.7 77.7 76.6 79.0 78.3 77.0
Chrf++
\cdashline1-11 Baseline with noise 55.5 54.9 54.1 55.1 54.7 52.5 55.0 54.9 53.6
Oracle w/o noise 57.2 56.9 55.5 57.2 56.9 55.5 57.2 56.9 55.5
\cdashline1-11 Pre-Filter Laser 56.5 54.5 53.4 56.3 56.0 52.6 55.2 55.0 54.3
Comet 56.0 54.2 53.0 55.0 54.2 51.9 55.2 54.2 52.8
\cdashline1-11 Truncation loss-frac 56.0 54.3 54.2 55.5 54.1 52.0 55.5 55.0 54.0
el2n-frac 56.1 55.2 54.2 55.5 55.0 52.0 56.2 55.0 54.2
el2n-threshold 56.5 56.3 54.6 56.2 55.0 52.8 56.2 55.5 54.0
\cdashline1-11 Self-Correction (Ours) fixed τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 56.8* 56.5* 54.3* 56.6* 55.2* 52.8* 56.6* 55.5* 54.0*
dynamic τ𝜏\tauitalic_τ 56.9* 56.2* 54.6* 56.4* 55.6* 53.0* 56.7* 55.8* 54.9*
Table 9: Comet and Chrf++ scores of high-resource De \rightarrow En translation task with different types of noise. The Comet score of full clean training corpus (5.8M) De \rightarrow En is 80.0. The Chrf++ score of full clean training corpus (5.8M) De \rightarrow En is 57.2.
COMET
\cdashline1-11 Misaligned-Laser Misaligned-Comet Raw-Crawl Data
10% 30% 50% 10% 30% 50% 10% 30% 50%
\cdashline1-11 Baseline with noise 79.8 79.0 77.8 79.7 75.9 71.6 79.7 79.5 78.3
Oracle w/o noise 79.8 79.4 78.9 79.8 79.4 78.9 79.8 79.4 78.9
\cdashline1-11 Pre-Filter Laser 79.5 78.5 77.0 79.5 76.2 74.7 79.8 79.8 79.0
Comet 79.6 78.8 76.8 79.2 76.0 71.0 79.5 79.0 77.8
\cdashline1-11 Truncation loss-frac 79.9 78.4 78.0 79.0 75.6 71.2 79.8 79.4 78.6
el2n-frac 80.1 79.1 78.2 79.8 76.2 72.3 79.9 79.5 78.8
el2n-threshold 80.0 79.0 78.5 79.7 76.5 73.5 80.0 80.0 79.0
\cdashline1-11 Self-Correction (Ours) fixed τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 80.3 79.0 78.5 79.9 77.0 74.0 80.3 79.8 79.5
dynamic τ𝜏\tauitalic_τ 80.1 79.2 78.8 79.9 77.1 74.6 80.2 80.1 79.2
Chrf++
\cdashline1-11 Baseline with noise 35.7 34.0 33.0 34.9 30.1 24.2 35.6 34.0 32.7
Oracle w/o noise 35.9 34.6 34.2 35.9 34.6 34.2 35.9 34.6 34.2
\cdashline1-11 Pre-Filter Laser 35.4 33.2 32.5 35.4 31.2 28.0 35.7 34.2 33.0
Comet 35.4 33.5 32.6 33.6 29.5 23.8 35.4 33.8 32.5
\cdashline1-11 Truncation loss-frac 35.8 33.6 33.2 35.3 30.2 25.8 35.7 34.2 32.8
el2n-frac 35.6 34.1 33.3 35.6 30.4 26.0 35.6 34.1 32.8
el2n-threshold 35.7 34.2 33.5 36.0 30.8 26.5 35.6 35.2 33.0
\cdashline1-11 Self-Correction (Ours) fixed τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 36.0* 34.3* 33.3* 35.5* 31.0* 27.0* 36.0* 34.8* 33.8*
dynamic τ𝜏\tauitalic_τ 35.8 34.4* 33.6* 35.8* 31.5* 28.3* 35.8* 35.0* 33.4*
Table 10: Comet and Chrf++ scores of low-resource En \rightarrow Si translation task with different types of noise. The Comet score of full clean training corpus (0.9M) En \rightarrow Si is 82.0. The Chrf++ score of full clean training corpus (0.9M) En \rightarrow Si is 37.0.
COMET
en\rightarrowfr\heartsuit en\rightarrowru\heartsuit en\rightarrowtr en\rightarrowes en\rightarrowbe en\rightarrowht en\rightarrowsi\heartsuit en\rightarrowkm\heartsuit Avg.
Misaligned Rate (%) 10% 19% 44% 22% 10% 35% 62% 18% -
Corpus Size (M) 5M 5M 5M 5M 1.1M 0.55M 0.21M 0.06M -
Baseline 80.0 79.2 82.0 76.5 68.3 63.2 59.6 73.6 72.8
\cdashline1-10 Pre-Filter Laser 81.0 81.0 81.3 76.7 67.4 63.2 59.7 73.6 73.0
Comet 80.5 79.1 81.0 76.0 68.9 63.0 59.5 73.2 72.7
\cdashline1-10 Truncation loss-frac 81.0 79.0 82.2 76.8 67.6 63.1 59.0 73.0 72.7
el2n-frac 80.2 79.1 82.1 76.2 68.6 63.0 60.0 72.8 72.8
el2n-threshold 81.4 81.3 82.3 76.8 68.7 63.3 60.7 73.0 73.4
\cdashline1-10 Self-Correction fixed τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 81.2 79.6 82.5 76.4 68.4 63.5 63.0 74.5 73.6
dynamic τ𝜏\tauitalic_τ 81.6 80.6 83.0 77.9 68.5 63.6 63.3 75.0 74.2
ChrF++
Baseline 67.3 50.6 54.8 49.1 36.5 45.4 20.2 15.6 42.4
\cdashline1-10 Pre-Filter Laser 67.9 51.5 54.6 49.6 36.1 45.1 21.7 14.7 42.7
Comet 67.6 50.9 54.3 49.6 36.2 44.8 20.6 15.0 42.4
\cdashline1-10 Truncation loss-frac 67.4 50.4 55.2 49.2 36.3 45.3 20.0 13.4 42.2
el2n-frac 67.6 50.6 55.2 49.5 36.5 45.2 20.6 13.0 42.3
el2n-threshold 67.3 51.1 55.3 49.3 36.8 45.6 20.4 13.9 42.5
\cdashline1-10 Self-Correction fixed τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 68.0* 50.9* 54.9* 49.6* 36.8* 45.6* 24.0* 16.5* 43.2*
dynamic τ𝜏\tauitalic_τ 68.2* 51.3* 55.4* 50.0* 37.2* 46.2* 22.2* 16.8 * 43.4*
Table 11: Comet and Chrf++ scores on real-world web-mined corpora. For pre-filter methods, we remove 20% of the training samples with the lowest scores. † denotes language pairs from CCAligned V1.0. \heartsuitdenotes language pairs from ParaCrawl V7.1. The misaligned noise rate for different language pairs is reported from Kreutzer et al. (2022).