\newrobustcmd\B\DeclareFontSeriesDefault

[rm]bfb

Investigating Multi-Pivot Ensembling
with Massively Multilingual Machine Translation Models

Alireza Mohammadshahi   1,2   Jannis Vamvas  1   Rico Sennrich 1
1 University of Zurich       2 EPFL
[email protected]
{vamvas,sennrich}@cl.uzh.ch
 Work done while working at University of Zurich. Currently co-founder of Leeroo.
Abstract

Massively multilingual machine translation models allow for the translation of a large number of languages with a single model, but have limited performance on low- and very-low-resource translation directions. Pivoting via high-resource languages remains a strong strategy for low-resource directions, and in this paper we revisit ways of pivoting through multiple languages. Previous work has used a simple averaging of probability distributions from multiple paths, but we find that this performs worse than using a single pivot, and exacerbates the hallucination problem because the same hallucinations can be probable across different paths. We also propose MaxEns, a novel combination strategy that makes the output biased towards the most confident predictions, hypothesising that confident predictions are less prone to be hallucinations. We evaluate different strategies on the FLORES benchmark for 20 low-resource language directions, demonstrating that MaxEns improves translation quality for low-resource languages while reducing hallucination in translations, compared to both direct translation and an averaging approach. On average, multi-pivot strategies still lag behind using English as a single pivot language, raising the question of how to identify the best pivoting strategy for a given translation direction.111The implementation is publicly available at https://github.com/ZurichNLP/MultiPivotNMT.

1 Introduction

Early work on multilingual neural machine translation (NMT) has explored combining source segments in different source languages Zoph and Knight (2016); Firat et al. (2016a), an idea that is also compatible with pivoting through intermediate languages. For example, one could translate from Dutch to Ukrainian by first translating the Dutch source to English and Russian, and then making a combined prediction to Ukrainian. In the simplest case, this combination is achieved by predicting probability distributions for each source language and averaging these predictions in an ensemble-like manner Firat et al. (2016a).

With massively multilingual NMT models NLLB Team et al. (2022); Mohammadshahi et al. (2022a); Goyal et al. (2022); Wenzek et al. (2021); Zhang et al. (2020); Fan et al. (2021); Aharoni et al. (2019); Arivazhagan et al. (2019), one can in principle translate directly in any translation direction. While early models relied on zero-shot generalization for many directions, recent improvements include massive data collection efforts Schwenk et al. (2021); El-Kishky et al. (2020) and synthetic data creation via back-translation Edunov et al. (2018); Sennrich et al. (2016). However, these models still have low performance on many low-resource translation directions222SentencePiece BLEU of 63% translation directions in M2M-100 is lower than 12 (Mohammadshahi et al., 2022b). and pivot-translation via high-resource languages remains a strong baseline. Fan et al. (2021) also investigate the combination of multiple translation paths, which they call multi-source self-ensemble, that slightly improves over the direct translation and a single pivot for zero-shot language pairs.

In this paper, we investigate this multi-source self-ensembling strategy more closely, with a focus on preventing completely defunct translations such as hallucinations. However, we find that simple averaging is sub-optimal and may increase the number of hallucinations in the output, a typical failure case in low-resource settings. We relate this to a recent finding that hallucinations are sticky, meaning that different models trained on the same data and architecture may produce similar hallucinations Guerreiro et al. (2023a). We also find evidence of such stickiness when combining multiple translation paths, and propose a new ensembling strategy that, instead of averaging probabilities, picks the output with the maximum probability across different paths: MaxEns. This is partially inspired by the finding that model confidence is a good heuristic for avoiding hallucinations, which tend to be low-confidence predictions Guerreiro et al. (2023b).

We perform experiments on the FLORES benchmark Goyal et al. (2022) for 20 low-resource translation directions by using two massively multilingual NMT models, SMaLL100 Mohammadshahi et al. (2022a) and M2M100 Fan et al. (2021). Our results show that while the average ensemble outperforms the direct translation, it still underperforms using only English as a pivot, both in terms of spBLEU and the number of hallucinations. MaxEns performs significantly better than the averaging strategy for both translation performance and hallucination. Specially, MaxEns has competitive translation performance with English pivoting on average, but still lags behind it on the hallucination performance. To sum up, our contributions are:

  • We explore why a naive multi-pivot strategy with massively multilingual models can underperform single-pivot translation. Then, we propose MaxEns, a more robust ensembling technique for multi-pivot translation with multilingual NMT models.

  • We evaluate different ensembling strategies on 20 low-resource translation directions of FLORES benchmark, and demonstrate that multi-pivot ensembling still lags behind the English pivoting.

2 Related Work

Several approaches exploited different multi-pivoting methods to improve the performance of NMT models, specifically for low-resource language directions Macháček et al. (2023); Dabre et al. (2021); Kim et al. (2019); Cheng et al. (2017); Firat et al. (2016b). Macháček et al. (2023) analyzed the robustness of multi-source NMT in transcription errors. Dabre et al. (2021) improved the performance of simultaneous NMT by translating the source language into pivot languages, then applying the multi-source translation method Zoph and Knight (2016)Firat et al. (2016b) proposed a novel zero-resource translation approach by exploiting the multi-way multilingual NMT model, introduced by Firat et al. (2016a), and improved the performance over traditional pivot-based translation Wu and Wang (2007); Utiyama and Isahara (2007)Cheng et al. (2017) introduced the pivot-based NMT model by jointly training source-to-pivot and pivot-to-target directions. Currey and Heafield (2019) proposed an alternative method by applying a monolingual pivot-language data for zero-resource NMT via back-translation Sennrich et al. (2016).

3 Ensembling Methods

When performing direct translation, the score of a translation Y𝑌Yitalic_Y given a source sequence Xsrcsubscript𝑋𝑠𝑟𝑐X_{src}italic_X start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT is computed as follows:

s(Y;Xsrc)=i=1|Y|logp(yi|y<i,Xsrc)𝑠𝑌subscript𝑋𝑠𝑟𝑐superscriptsubscript𝑖1𝑌𝑝conditionalsubscript𝑦𝑖subscript𝑦absent𝑖subscript𝑋𝑠𝑟𝑐s(Y;X_{src})=\sum_{i=1}^{|Y|}\log\leavevmode\nobreak\ p(y_{i}|y_{<i},X_{src})italic_s ( italic_Y ; italic_X start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_Y | end_POSTSUPERSCRIPT roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT ) (1)

where p(yi|y<i,Xsrc)𝑝conditionalsubscript𝑦𝑖subscript𝑦absent𝑖subscript𝑋𝑠𝑟𝑐p(y_{i}|y_{<i},X_{src})italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT ) is the predicted probability of the i𝑖iitalic_i-th target token yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given the previous tokens y<isubscript𝑦absent𝑖y_{<i}italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT and the source sequence Xsrcsubscript𝑋𝑠𝑟𝑐X_{src}italic_X start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT.

For multi-pivot ensembling, we select a set of pivot languages M={μ1,μ2,,μK}𝑀subscript𝜇1subscript𝜇2subscript𝜇𝐾M=\{\mu_{1},\mu_{2},...,\mu_{K}\}italic_M = { italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } and generate the corresponding pivot translations XM={Xμ1,Xμ2,,XμK}subscript𝑋𝑀subscript𝑋subscript𝜇1subscript𝑋subscript𝜇2subscript𝑋subscript𝜇𝐾X_{M}=\{X_{\mu_{1}},X_{\mu_{2}},...,X_{\mu_{K}}\}italic_X start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. The final translation is generated by ensembling predictions, conditioned on the individual pivot translations.

In the following, we describe two approaches for such an ensembling: the multilingual averaging method and our MaxEns approach.

Multilingual Average (MultiAvg).

Inspired by Fan et al. (2021); Firat et al. (2016b), we average the predicted probabilities of a token yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT across all pivot languages:333We tried both averaging probabilities and log-probabilities in preliminary experiments, and averaging probabilities worked better in terms of translation performance and hallucination.

s(Y;XM)=i=1|Y|log1|M|k=1|M|p(yi|y<i,Xμk).𝑠𝑌subscript𝑋𝑀superscriptsubscript𝑖1𝑌1𝑀superscriptsubscript𝑘1𝑀𝑝conditionalsubscript𝑦𝑖subscript𝑦absent𝑖subscript𝑋subscript𝜇𝑘s(Y;X_{M})=\sum_{i=1}^{|Y|}\log\frac{1}{|M|}\sum_{k=1}^{|M|}\leavevmode% \nobreak\ p(y_{i}|y_{<i},X_{\mu_{k}}).italic_s ( italic_Y ; italic_X start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_Y | end_POSTSUPERSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG | italic_M | end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . (2)

where |Y|𝑌|Y|| italic_Y | and |M|𝑀|M|| italic_M | are the number of target tokens and pivots, respectively.

Refer to caption
Figure 1: A sample translation of Afrikaans to Asturian by using SMaLL100. Translations of individual pivots are shown in the middle, output translations of MaxEns and MultiAvg on the right. MaxEns method eliminates the hallucination, as it follows the more confident pivot (here, Spanish). Glosses in English are presented within gray boxes.

Maximum Ensemble (MaxEns).

As novel combination strategy that biases the prediction towards the more confident pivot, we propose the following approach:

s(Y;XM)=i=1|Y|maxk=1|M|[logp(yi|y<i,Xμk)].𝑠𝑌subscript𝑋𝑀superscriptsubscript𝑖1𝑌superscriptsubscript𝑘1𝑀𝑝conditionalsubscript𝑦𝑖subscript𝑦absent𝑖subscript𝑋subscript𝜇𝑘s(Y;X_{M})=\sum_{i=1}^{|Y|}\max_{k=1}^{|M|}[\log\leavevmode\nobreak\ p(y_{i}|y% _{<i},X_{\mu_{k}})].italic_s ( italic_Y ; italic_X start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_Y | end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT [ roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ] . (3)

where it chooses the maximum score between predictions of pivots for token yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Intuitively, MaxEns selects the most confident pivot language when generating token yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

4 Results and Discussion

4.1 Experimental Setup

Models.

We used M2M100 and SMaLL100 as our massively multilingual NMT models. M2M100 is trained on large-scale multilingual corpora Schwenk et al. (2021); El-Kishky et al. (2020) with a novel data mining procedure, that uses language similarities. We exploit M2M100 variant with 418M parameters. SMaLL100 Mohammadshahi et al. (2022a) is a distilled version of M2M100 12B with 330M parameters. It has been trained with uniform sampling across all language pairs on nearly 6% of M2M100 pre-training dataset, and achieved competitive performance with M2M100 with 1.2B parameters.

Evaluation Setting.

Inspired by Fan et al. (2021), we use the FLORES-101 benchmark Goyal et al. (2022). It contains 3,001 sentences derived from English Wikipedia, and translated into 101 languages by human. We use the devtest subset for the evaluation. To better understand the effect of multilingual pivoting, we chose five low-resource (or very low) languages from different branches of Indo-European, including Germanic, Romance, Slavic, Indo-Aryan, and Iranian. These languages are Afrikaans, Asturian, Croatian, Urdu, and Pashto. We evaluate on all permutations of these languages, which results in 20 language pairs. As pivot languages, we use English, Spanish and French. English has the largest amount of bitext overall in the training data of M2M100, and Spanish and French have the largest amount of bitext with English Fan et al. (2021).

Evaluation Metrics.

spBLEU444BLEU is computed after tokenization with SentencePiece with 256K tokens Goyal et al. (2022). is used to measure the translation performance Goyal et al. (2022). For the hallucination measurement, we apply a coarse estimation method inspired by Lee et al. (2019); Müller and Sennrich (2021), counting the proportion of sentences with ChrF Popović (2015)555sacrebleu 2.3.1 Post (2018) with ChrF3 is used. less than 20.666Threshold based on manual inspection. Additionally, we use top n-gram (TNG) Guerreiro et al. (2023b); Raunak et al. (2022, 2021) for detecting oscillatory hallucinations.777We follow Guerreiro et al. (2023b) and use n=4𝑛4n=4italic_n = 4 and t=2𝑡2t=2italic_t = 2. We apply significance testing with p=0.05𝑝0.05p=0.05italic_p = 0.05.888Paired bootstrap resampling Koehn (2004) with sacrebleu. Beam size 5 is used for inference.

Language Pairs Direct MultiAvg MaxEns EN Pivot Afrikaans-Asturian 20.620.620.620.6 21.821.821.821.8 \B22.522.522.522.5 21.021.021.021.0 Afrikaans-Croatian 22.122.122.122.1 22.522.522.522.5 22.422.422.422.4 \B22.822.822.822.8 Afrikaans-Urdu 14.014.014.014.0 13.813.813.813.8 13.913.913.913.9 \B14.414.414.414.4 Afrikaans-Pashto \B5.95.95.95.9 \B5.85.85.85.8 \B6.06.06.06.0 \B6.06.06.06.0 Asturian-Afrikaans 18.518.518.518.5 19.919.919.919.9 19.819.819.819.8 \B21.021.021.021.0 Asturian-Croatian 16.116.116.116.1 \B20.420.420.420.4 20.020.020.020.0 \B20.420.420.420.4 Asturian-Urdu 8.48.48.48.4 11.811.811.811.8 11.811.811.811.8 \B12.612.612.612.6 Asturian-Pashto 3.63.63.63.6 5.05.05.05.0 5.05.05.05.0 \B5.45.45.45.4 Croatian-Afrikaans 20.520.520.520.5 20.720.720.720.7 20.820.820.820.8 \B21.221.221.221.2 Croatian-Asturian 19.719.719.719.7 20.820.820.820.8 \B21.621.621.621.6 19.719.719.719.7 Croatian-Urdu \B13.513.513.513.5 13.113.113.113.1 12.912.912.912.9 13.213.213.213.2 Croatian-Pashto 5.05.05.05.0 5.25.25.25.2 5.25.25.25.2 \B5.65.65.65.6 Urdu-Afrikaans 12.112.112.112.1 13.013.013.013.0 13.213.213.213.2 \B13.813.813.813.8 Urdu-Asturian 7.77.77.77.7 11.711.711.711.7 \B12.812.812.812.8 12.212.212.212.2 Urdu-Croatian 11.211.211.211.2 \B12.012.012.012.0 \B11.911.911.911.9 \B12.212.212.212.2 Urdu-Pashto \B4.74.74.74.7 \B4.44.44.44.4 4.24.24.24.2 \B4.64.64.64.6 Pashto-Afrikaans 10.210.210.210.2 11.011.011.011.0 11.011.011.011.0 \B11.411.411.411.4 Pashto-Asturian 6.96.96.96.9 9.99.99.99.9 \B10.910.910.910.9 10.410.410.410.4 Pashto-Croatian 8.78.78.78.7 \B9.99.99.99.9 \B10.010.010.010.0 \B9.89.89.89.8 Pashto-Urdu \B10.010.010.010.0 9.29.29.29.2 9.29.29.29.2 9.59.59.59.5 Average 12.012.012.012.0 13.113.113.113.1 \B13.313.313.313.3 \B13.413.413.413.4

Table 1: Average spBLEU (higher is better) of different pivoting methods for M2M100 and SMaLL100 on selected language pairs of FLORES-101. Best systems (not significantly outperformed by any other) in bold.

Language Pairs Direct MultiAvg MaxEns EN Pivot Afrikaans-Asturian 5.95.95.95.9 6.46.46.46.4 4.04.04.04.0 4.34.34.34.3 Afrikaans-Croatian 1.71.71.71.7 2.22.22.22.2 2.22.22.22.2 1.91.91.91.9 Afrikaans-Urdu 11.311.311.311.3 13.913.913.913.9 13.413.413.413.4 11.111.111.111.1 Afrikaans-Pashto 49.549.549.549.5 54.154.154.154.1 53.253.253.253.2 49.849.849.849.8 Asturian-Afrikaans 8.88.88.88.8 6.26.26.26.2 7.87.87.87.8 2.12.12.12.1 Asturian-Croatian 19.519.519.519.5 5.35.35.35.3 8.28.28.28.2 3.53.53.53.5 Asturian-Urdu 40.740.740.740.7 23.423.423.423.4 22.922.922.922.9 18.018.018.018.0 Asturian-Pashto 66.766.766.766.7 62.262.262.262.2 62.262.262.262.2 55.955.955.955.9 Croatian-Afrikaans 1.11.11.11.1 1.51.51.51.5 1.21.21.21.2 1.31.31.31.3 Croatian-Asturian 6.16.16.16.1 6.36.36.36.3 3.83.83.83.8 5.05.05.05.0 Croatian-Urdu 13.213.213.213.2 16.516.516.516.5 16.316.316.316.3 12.712.712.712.7 Croatian-Pashto 54.754.754.754.7 58.458.458.458.4 56.656.656.656.6 53.153.153.153.1 Urdu-Afrikaans 9.89.89.89.8 9.29.29.29.2 9.49.49.49.4 4.94.94.94.9 Urdu-Asturian 31.531.531.531.5 19.919.919.919.9 14.714.714.714.7 12.512.512.512.5 Urdu-Croatian 14.114.114.114.1 15.415.415.415.4 15.515.515.515.5 11.811.811.811.8 Urdu-Pashto 56.856.856.856.8 66.066.066.066.0 66.866.866.866.8 60.160.160.160.1 Pashto-Afrikaans 9.19.19.19.1 10.510.510.510.5 10.010.010.010.0 6.66.66.66.6 Pashto-Asturian 26.926.926.926.9 21.221.221.221.2 15.715.715.715.7 15.615.615.615.6 Pashto-Croatian 18.418.418.418.4 18.518.518.518.5 18.718.718.718.7 16.716.716.716.7 Pashto-Urdu 24.724.724.724.7 33.233.233.233.2 33.333.333.333.3 28.928.928.928.9 Average 23.523.523.523.5 22.522.522.522.5 21.821.821.821.8 18.818.818.818.8

Table 2: Average percentage (100%) of hallucinations (chrF < 20; lower is better) of different pivoting methods for M2M100 and SMaLL100 on selected language pairs of FLORES-101.

4.2 Results & Discussion

Figure 1 illustrates an example of multi-pivot translation of Afrikaans to Asturian by using English, French, and Spanish as pivots. Translations via English and French pivots are hallucinations, while the translation via the Spanish pivot is more related to the reference translation. The output of MaxEns approach is closer to the translation achieved by using Spanish as the pivot language, since the NMT model is more confident for this pivot (maximum of output probability distributions for the first token of English, French, and Spanish pivots are 0.15, 0.14, and 0.91, respectively). In contrast, the output of the MultiAvg method is a hallucination.
Tables 1 and 2 show translation and hallucination performances on 20 language directions, respectively.999Average scores of M2M100 and SMaLL100 are shown in Table 1 and 2. Individual scores are provided in Appendix A. TNG scores for measuring oscillatory hallucinations are provided in Appendix B. MultiAvg approach achieves better translation performance and lower hallucination compared to the direct translation. However, MultiAvg method lags behind the English pivoting approach in terms of both translation quality (13.4 vs. 13.1) and the occurrence of hallucinations (18.8% vs. 22.5%).1010104.5% vs. 7.3% based on the TNG metric, as shown in Table 7.
Applying the MaxEns method instead for combining pivots tightens this gap, and leads to better translation and reduces the occurrence of hallucinations. Specifically, MaxEns reaches competitive translation quality with English pivoting on average (13.3 vs. 13.4), while still underperforming on hallucinations, as most of the parallel sentences of pre-training data for M2M100 and SMaLL100 are paired with English.

In general, the optimal strategy differs across translation directions, highlighting the potential for future research on determining the most effective translation strategy for each direction without depending on the development data for each.

5 Conclusion

We investigate more closely the multi-source self-ensembling method of Fan et al. (2021) for combining multiple translation paths to improve translations of low-resource (or very-low) language pairs. Specifically, this approach (named MultiAvg, here) averages the predictions of probability distributions of each source language in an ensemble-like manner. We evaluated it on 20 low-resource language pairs of FLORES-101 benchmark by using two massively multilingual NMT models, SMALL100 and M2M100. The MultiAvg method performs better than direct translation in terms of both translation quality and hallucinations, however it lags behind applying only English as pivot. Then, we proposed MaxEns method, a novel combination method that chooses the maximum of prediction probabilities of pivots for each designated target token. This approach results in a better translation quality compared to MultiAvg, while reducing hallucinations. On average, it achieves competitive performance with English pivoting with regard to the translation quality metric, but performs worse with regard to the hallucination metric. The most effective translation strategy varies depending on the translation direction, suggesting the need for future research to identify the optimal strategy for each direction independently of the specific development data. We hope our findings are a starting point for the broader integration of ensemble techniques within the context of massively multilingual NMT.

The insights of our experiments, specifically the stickiness of hallucinations with different inputs, have inspired our follow-up work on source-contrastive decoding Sennrich et al. (2024), which has empirically shown to be an effective strategy to mitigate hallucinations. Future work could revisit multi-pivot ensembling in combination with source-contrastive decoding.

Limitations

We apply our method to two common massively multilingual NMT models, SMALL100 and M2M100; future work can extend our work to more recent large models e.g. NLLB200 NLLB Team et al. (2022) and LLMs Touvron et al. (2023); Workshop et al. (2022). We tested our approach on a subset of 20 low-resource language directions, future research can study the method for further language directions, including medium-resource language pairs.

Acknowledgement

This work was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727/213976).

References

  • Aharoni et al. (2019) Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3874–3884, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Arivazhagan et al. (2019) Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. 2019. Massively multilingual neural machine translation in the wild: Findings and challenges.
  • Cheng et al. (2017) Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and Wei Xu. 2017. Joint training for pivot-based neural machine translation. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 3974–3980.
  • Currey and Heafield (2019) Anna Currey and Kenneth Heafield. 2019. Zero-resource neural machine translation with monolingual pivot data. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 99–107, Hong Kong. Association for Computational Linguistics.
  • Dabre et al. (2021) Raj Dabre, Aizhan Imankulova, Masahiro Kaneko, and Abhisek Chakrabarty. 2021. Simultaneous multi-pivot neural machine translation.
  • Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500, Brussels, Belgium. Association for Computational Linguistics.
  • El-Kishky et al. (2020) Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, and Philipp Koehn. 2020. CCAligned: A massive collection of cross-lingual web-document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5960–5969, Online. Association for Computational Linguistics.
  • Fan et al. (2021) Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Michael Auli, and Armand Joulin. 2021. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48.
  • Firat et al. (2016a) Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016a. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 866–875, San Diego, California. Association for Computational Linguistics.
  • Firat et al. (2016b) Orhan Firat, Baskaran Sankaran, Yaser Al-onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. 2016b. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 268–277, Austin, Texas. Association for Computational Linguistics.
  • Goyal et al. (2022) Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. 2022. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  • Guerreiro et al. (2023a) Nuno M. Guerreiro, Duarte M. Alves, Jonas Waldendorf, Barry Haddow, Alexandra Birch, Pierre Colombo, and André F. T. Martins. 2023a. Hallucinations in Large Multilingual Translation Models. Transactions of the Association for Computational Linguistics, 11:1500–1517.
  • Guerreiro et al. (2023b) Nuno M. Guerreiro, Elena Voita, and André Martins. 2023b. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1059–1075, Dubrovnik, Croatia. Association for Computational Linguistics.
  • Kim et al. (2019) Yunsu Kim, Petre Petrov, Pavel Petrushkov, Shahram Khadivi, and Hermann Ney. 2019. Pivot-based transfer learning for neural machine translation between non-English languages. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 866–876, Hong Kong, China. Association for Computational Linguistics.
  • Koehn (2004) Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395, Barcelona, Spain. Association for Computational Linguistics.
  • Lee et al. (2019) Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. 2019. Hallucinations in neural machine translation.
  • Macháček et al. (2023) Dominik Macháček, Peter Polak, Ondřej Bojar, and Raj Dabre. 2023. Robustness of multi-source MT to transcription errors. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3707–3723, Toronto, Canada. Association for Computational Linguistics.
  • Mohammadshahi et al. (2022a) Alireza Mohammadshahi, Vassilina Nikoulina, Alexandre Berard, Caroline Brun, James Henderson, and Laurent Besacier. 2022a. SMaLL-100: Introducing shallow multilingual machine translation model for low-resource languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8348–8359, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Mohammadshahi et al. (2022b) Alireza Mohammadshahi, Vassilina Nikoulina, Alexandre Berard, Caroline Brun, James Henderson, and Laurent Besacier. 2022b. What do compressed multilingual machine translation models forget? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4308–4329, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Müller and Sennrich (2021) Mathias Müller and Rico Sennrich. 2021. Understanding the properties of minimum Bayes risk decoding in neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 259–272, Online. Association for Computational Linguistics.
  • NLLB Team et al. (2022) NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2022. No language left behind: Scaling human-centered machine translation.
  • Popović (2015) Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  • Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  • Raunak et al. (2021) Vikas Raunak, Arul Menezes, and Marcin Junczys-Dowmunt. 2021. The curious case of hallucinations in neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1172–1183, Online. Association for Computational Linguistics.
  • Raunak et al. (2022) Vikas Raunak, Matt Post, and Arul Menezes. 2022. SALTED: A framework for SAlient long-tail translation error detection. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5163–5179, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Schwenk et al. (2021) Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin, and Angela Fan. 2021. CCMatrix: Mining billions of high-quality parallel sentences on the web. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6490–6500, Online. Association for Computational Linguistics.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Linguistics.
  • Sennrich et al. (2024) Rico Sennrich, Jannis Vamvas, and Alireza Mohammadshahi. 2024. Mitigating hallucinations and off-target machine translation with source-contrastive and language-contrastive decoding. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 21–33, St. Julian’s, Malta. Association for Computational Linguistics.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models.
  • Utiyama and Isahara (2007) Masao Utiyama and Hitoshi Isahara. 2007. A comparison of pivot methods for phrase-based statistical machine translation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 484–491, Rochester, New York. Association for Computational Linguistics.
  • Wenzek et al. (2021) Guillaume Wenzek, Vishrav Chaudhary, Angela Fan, Sahir Gomez, Naman Goyal, Somya Jain, Douwe Kiela, Tristan Thrush, and Francisco Guzmán. 2021. Findings of the WMT 2021 shared task on large-scale multilingual machine translation. In Proceedings of the Sixth Conference on Machine Translation, pages 89–99, Online. Association for Computational Linguistics.
  • Workshop et al. (2022) BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Davut Emre Taşar, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Daniel McDuff, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, and Thomas Wolf. 2022. Bloom: A 176b-parameter open-access multilingual language model.
  • Wu and Wang (2007) Hua Wu and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 856–863, Prague, Czech Republic. Association for Computational Linguistics.
  • Zhang et al. (2020) Biao Zhang, Philip Williams, Ivan Titov, and Rico Sennrich. 2020. Improving massively multilingual neural machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1628–1639, Online. Association for Computational Linguistics.
  • Zoph and Knight (2016) Barret Zoph and Kevin Knight. 2016. Multi-source neural translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 30–34, San Diego, California. Association for Computational Linguistics.

Appendix A Individual Scores of M2M100 and SMaLL100 Models

A.A M2M100 Results

Language Pairs Direct MultiAvg MaxEns EN Pivot Afrikaans-Asturian 19.319.319.319.3 20.220.220.220.2 \B21.021.021.021.0 20.220.220.220.2 Afrikaans-Croatian \B20.820.820.820.8 \B21.121.121.121.1 \B21.021.021.021.0 \B21.421.421.421.4 Afrikaans-Urdu 14.014.014.014.0 13.613.613.613.6 13.813.813.813.8 \B14.414.414.414.4 Afrikaans-Pashto \B5.45.45.45.4 \B5.45.45.45.4 \B5.55.55.55.5 \B5.65.65.65.6 Asturian-Afrikaans 14.214.214.214.2 16.516.516.516.5 16.116.116.116.1 \B18.018.018.018.0 Asturian-Croatian 11.111.111.111.1 \B19.019.019.019.0 18.318.318.318.3 \B19.419.419.419.4 Asturian-Urdu 6.36.36.36.3 11.411.411.411.4 11.411.411.411.4 \B12.612.612.612.6 Asturian-Pashto 2.42.42.42.4 4.44.44.44.4 4.54.54.54.5 \B5.05.05.05.0 Croatian-Afrikaans 17.617.617.617.6 \B17.917.917.917.9 \B17.917.917.917.9 \B18.218.218.218.2 Croatian-Asturian 18.818.818.818.8 19.519.519.519.5 \B20.620.620.620.6 19.119.119.119.1 Croatian-Urdu \B13.613.613.613.6 13.313.313.313.3 13.113.113.113.1 \B13.413.413.413.4 Croatian-Pashto 4.44.44.44.4 4.94.94.94.9 4.94.94.94.9 \B5.45.45.45.4 Urdu-Afrikaans 9.09.09.09.0 9.89.89.89.8 10.010.010.010.0 \B10.610.610.610.6 Urdu-Asturian 7.17.17.17.1 9.99.99.99.9 \B10.810.810.810.8 \B10.910.910.910.9 Urdu-Croatian 8.98.98.98.9 \B10.010.010.010.0 \B9.89.89.89.8 \B10.110.110.110.1 Urdu-Pashto \B4.24.24.24.2 3.83.83.83.8 3.63.63.63.6 \B4.24.24.24.2 Pashto-Afrikaans 8.38.38.38.3 \B9.39.39.39.3 8.98.98.98.9 \B9.39.39.39.3 Pashto-Asturian 7.87.87.87.8 9.69.69.69.6 \B10.410.410.410.4 9.79.79.79.7 Pashto-Croatian 8.08.08.08.0 \B9.09.09.09.0 \B9.09.09.09.0 8.58.58.58.5 Pashto-Urdu \B9.89.89.89.8 8.98.98.98.9 8.98.98.98.9 9.09.09.09.0 Average 10.610.610.610.6 11.911.911.911.9 12.012.012.012.0 \B12.212.212.212.2

Table 3: spBLEU (higher is better) of different pivoting methods for M2M100 model on selected language pairs of FLORES-101 Goyal et al. (2022) benchmark. Best systems (not significantly outperformed by any other) in bold.

Language Pairs Direct MultiAvg MaxEns EN Pivot Afrikaans-Asturian 7.07.07.07.0 8.88.88.88.8 5.05.05.05.0 3.63.63.63.6 Afrikaans-Croatian 2.62.62.62.6 2.62.62.62.6 2.42.42.42.4 2.32.32.32.3 Afrikaans-Urdu 12.212.212.212.2 15.515.515.515.5 14.514.514.514.5 11.511.511.511.5 Afrikaans-Pashto 53.653.653.653.6 59.759.759.759.7 57.057.057.057.0 54.854.854.854.8 Asturian-Afrikaans 15.715.715.715.7 10.010.010.010.0 13.013.013.013.0 2.82.82.82.8 Asturian-Croatian 35.035.035.035.0 7.37.37.37.3 12.612.612.612.6 4.44.44.44.4 Asturian-Urdu 55.855.855.855.8 26.726.726.726.7 27.327.327.327.3 18.418.418.418.4 Asturian-Pashto 73.473.473.473.4 68.068.068.068.0 67.867.867.867.8 58.658.658.658.6 Croatian-Afrikaans 1.41.41.41.4 1.91.91.91.9 1.51.51.51.5 1.71.71.71.7 Croatian-Asturian 6.56.56.56.5 8.48.48.48.4 4.54.54.54.5 3.93.93.93.9 Croatian-Urdu 13.313.313.313.3 17.017.017.017.0 16.416.416.416.4 13.613.613.613.6 Croatian-Pashto 59.159.159.159.1 62.062.062.062.0 58.858.858.858.8 57.557.557.557.5 Urdu-Afrikaans 15.215.215.215.2 14.514.514.514.5 14.814.814.814.8 7.77.77.77.7 Urdu-Asturian 38.938.938.938.9 28.028.028.028.0 21.521.521.521.5 15.915.915.915.9 Urdu-Croatian 20.320.320.320.3 22.322.322.322.3 22.222.222.222.2 17.917.917.917.9 Urdu-Pashto 60.260.260.260.2 72.772.772.772.7 73.373.373.373.3 65.165.165.165.1 Pashto-Afrikaans 11.511.511.511.5 12.812.812.812.8 12.712.712.712.7 9.09.09.09.0 Pashto-Asturian 23.123.123.123.1 25.125.125.125.1 17.917.917.917.9 18.018.018.018.0 Pashto-Croatian 20.520.520.520.5 21.621.621.621.6 22.522.522.522.5 21.121.121.121.1 Pashto-Urdu 26.626.626.626.6 35.935.935.935.9 35.535.535.535.5 31.931.931.931.9 Average 27.627.627.627.6 26.026.026.026.0 25.025.025.025.0 21.021.021.021.0

Table 4: The percentage of hallucinations (chrF < 20; lower is better) of different pivoting methods for M2M100 model on selected language pairs of FLORES-101 Goyal et al. (2022) benchmark.

A.B SMaLL100 Results

Language Pairs Direct MultiAvg MaxEns EN Pivot Afrikaans-Asturian 22.022.022.022.0 23.423.423.423.4 \B24.024.024.024.0 21.721.721.721.7 Afrikaans-Croatian 23.523.523.523.5 \B23.923.923.923.9 \B23.823.823.823.8 \B24.124.124.124.1 Afrikaans-Urdu 13.913.913.913.9 14.014.014.014.0 14.014.014.014.0 \B14.414.414.414.4 Afrikaans-Pashto \B6.46.46.46.4 6.16.16.16.1 \B6.46.46.46.4 \B6.46.46.46.4 Asturian-Afrikaans 22.822.822.822.8 \B23.323.323.323.3 \B23.423.423.423.4 \B23.823.823.823.8 Asturian-Croatian 21.121.121.121.1 \B21.821.821.821.8 \B21.621.621.621.6 \B21.421.421.421.4 Asturian-Urdu 10.510.510.510.5 12.112.112.112.1 12.212.212.212.2 \B12.512.512.512.5 Asturian-Pashto 4.84.84.84.8 \B5.65.65.65.6 \B5.55.55.55.5 \B5.75.75.75.7 Croatian-Afrikaans 23.423.423.423.4 23.423.423.423.4 23.723.723.723.7 \B24.224.224.224.2 Croatian-Asturian 20.620.620.620.6 \B22.122.122.122.1 \B22.522.522.522.5 20.220.220.220.2 Croatian-Urdu \B13.313.313.313.3 12.812.812.812.8 12.612.612.612.6 \B13.013.013.013.0 Croatian-Pashto \B5.65.65.65.6 \B5.45.45.45.4 \B5.45.45.45.4 \B5.85.85.85.8 Urdu-Afrikaans 15.115.115.115.1 16.116.116.116.1 16.316.316.316.3 \B17.017.017.017.0 Urdu-Asturian 8.38.38.38.3 13.413.413.413.4 \B14.814.814.814.8 13.413.413.413.4 Urdu-Croatian 13.413.413.413.4 14.014.014.014.0 13.913.913.913.9 14.314.314.314.3 Urdu-Pashto \B5.15.15.15.1 \B5.05.05.05.0 \B4.84.84.84.8 \B5.05.05.05.0 Pashto-Afrikaans 12.012.012.012.0 12.712.712.712.7 12.912.912.912.9 \B13.513.513.513.5 Pashto-Asturian 6.06.06.06.0 10.210.210.210.2 \B11.311.311.311.3 \B11.111.111.111.1 Pashto-Croatian 9.49.49.49.4 \B10.810.810.810.8 \B11.011.011.011.0 \B11.011.011.011.0 Pashto-Urdu \B10.210.210.210.2 9.59.59.59.5 9.49.49.49.4 9.99.99.99.9 Average 13.413.413.413.4 14.314.314.314.3 \B14.514.514.514.5 \B14.414.414.414.4

Table 5: spBLEU (higher is better) of different pivoting methods for SMaLL100 model on selected language pairs of FLORES-101 Goyal et al. (2022) benchmark. Best systems (not significantly outperformed by any other) in bold.

Language Pairs Direct MultiAvg MaxEns EN Pivot Afrikaans-Asturian 4.74.74.74.7 4.14.14.14.1 2.92.92.92.9 5.05.05.05.0 Afrikaans-Croatian 0.90.90.90.9 1.91.91.91.9 2.12.12.12.1 1.51.51.51.5 Afrikaans-Urdu 10.410.410.410.4 12.312.312.312.3 12.212.212.212.2 10.810.810.810.8 Afrikaans-Pashto 45.445.445.445.4 48.448.448.448.4 49.449.449.449.4 44.844.844.844.8 Asturian-Afrikaans 1.91.91.91.9 2.52.52.52.5 2.62.62.62.6 1.51.51.51.5 Asturian-Croatian 4.14.14.14.1 3.33.33.33.3 3.93.93.93.9 2.72.72.72.7 Asturian-Urdu 25.625.625.625.6 20.120.120.120.1 18.518.518.518.5 17.517.517.517.5 Asturian-Pashto 59.959.959.959.9 56.556.556.556.5 56.756.756.756.7 53.353.353.353.3 Croatian-Afrikaans 0.80.80.80.8 1.11.11.11.1 0.90.90.90.9 0.90.90.90.9 Croatian-Asturian 5.65.65.65.6 4.24.24.24.2 3.13.13.13.1 6.16.16.16.1 Croatian-Urdu 13.213.213.213.2 16.016.016.016.0 16.116.116.116.1 11.811.811.811.8 Croatian-Pashto 50.350.350.350.3 54.754.754.754.7 54.554.554.554.5 48.748.748.748.7 Urdu-Afrikaans 4.44.44.44.4 4.04.04.04.0 4.04.04.04.0 2.12.12.12.1 Urdu-Asturian 24.124.124.124.1 11.811.811.811.8 7.97.97.97.9 9.09.09.09.0 Urdu-Croatian 7.97.97.97.9 8.58.58.58.5 8.88.88.88.8 5.75.75.75.7 Urdu-Pashto 53.453.453.453.4 59.359.359.359.3 60.360.360.360.3 55.255.255.255.2 Pashto-Afrikaans 6.76.76.76.7 8.38.38.38.3 7.37.37.37.3 4.24.24.24.2 Pashto-Asturian 30.730.730.730.7 17.317.317.317.3 13.613.613.613.6 13.213.213.213.2 Pashto-Croatian 16.416.416.416.4 15.415.415.415.4 14.914.914.914.9 12.412.412.412.4 Pashto-Urdu 22.722.722.722.7 30.530.530.530.5 31.131.131.131.1 25.925.925.925.9 Average 19.519.519.519.5 19.019.019.019.0 18.518.518.518.5 16.616.616.616.6

Table 6: The percentage of hallucinations (chrF < 20; lower is better) of different pivoting methods for SMaLL100 model on selected language pairs of FLORES-101 Goyal et al. (2022) benchmark.

Appendix B Results of oscillatory hallucinations based on TNG metric

Language Pairs Direct MultiAvg MaxEns EN Pivot Afrikaans-Asturian 2.22.22.22.2 2.12.12.12.1 0.70.70.70.7 1.31.31.31.3 Afrikaans-Croatian 0.30.30.30.3 0.30.30.30.3 0.20.20.20.2 0.20.20.20.2 Afrikaans-Urdu 1.21.21.21.2 1.51.51.51.5 1.11.11.11.1 0.60.60.60.6 Afrikaans-Pashto 6.76.76.76.7 6.76.76.76.7 5.15.15.15.1 4.74.74.74.7 Asturian-Afrikaans 5.15.15.15.1 3.03.03.03.0 3.93.93.93.9 0.20.20.20.2 Asturian-Croatian 11.911.911.911.9 1.51.51.51.5 3.03.03.03.0 0.40.40.40.4 Asturian-Urdu 15.715.715.715.7 4.14.14.14.1 4.34.34.34.3 0.70.70.70.7 Asturian-Pashto 20.920.920.920.9 12.812.812.812.8 11.611.611.611.6 5.05.05.05.0 Croatian-Afrikaans 0.10.10.10.1 0.20.20.20.2 0.20.20.20.2 0.00.00.00.0 Croatian-Asturian 2.22.22.22.2 2.02.02.02.0 1.01.01.01.0 1.21.21.21.2 Croatian-Urdu 1.51.51.51.5 1.31.31.31.3 1.11.11.11.1 0.50.50.50.5 Croatian-Pashto 7.27.27.27.2 6.46.46.46.4 4.94.94.94.9 4.34.34.34.3 Urdu-Afrikaans 8.28.28.28.2 3.33.33.33.3 3.33.33.33.3 0.90.90.90.9 Urdu-Asturian 15.815.815.815.8 6.26.26.26.2 4.04.04.04.0 2.42.42.42.4 Urdu-Croatian 2.52.52.52.5 2.62.62.62.6 2.82.82.82.8 0.80.80.80.8 Urdu-Pashto 9.39.39.39.3 11.711.711.711.7 9.59.59.59.5 4.24.24.24.2 Pashto-Afrikaans 2.72.72.72.7 3.63.63.63.6 3.53.53.53.5 0.80.80.80.8 Pashto-Asturian 16.016.016.016.0 6.26.26.26.2 4.14.14.14.1 2.02.02.02.0 Pashto-Croatian 3.63.63.63.6 2.92.92.92.9 3.03.03.03.0 1.01.01.01.0 Pashto-Urdu 2.52.52.52.5 4.84.84.84.8 4.34.34.34.3 1.61.61.61.6 Average 7.37.37.37.3 4.54.54.54.5 4.04.04.04.0 1.91.91.91.9

Table 7: Average percentage (100%) of hallucinations (TNG metric; lower is better) of different pivoting methods for M2M100 and SMaLL100 on selected language pairs of FLORES-101.

Language Pairs Direct MultiAvg MaxEns EN Pivot Afrikaans-Asturian 1.81.81.81.8 2.62.62.62.6 1.11.11.11.1 0.50.50.50.5 Afrikaans-Croatian 0.50.50.50.5 0.490.490.490.49 0.20.20.20.2 0.30.30.30.3 Afrikaans-Urdu 1.91.91.91.9 2.72.72.72.7 1.61.61.61.6 0.90.90.90.9 Afrikaans-Pashto 10.410.410.410.4 11.611.611.611.6 9.19.19.19.1 8.08.08.08.0 Asturian-Afrikaans 9.89.89.89.8 5.65.65.65.6 6.86.86.86.8 0.20.20.20.2 Asturian-Croatian 22.622.622.622.6 2.52.52.52.5 5.25.25.25.2 0.40.40.40.4 Asturian-Urdu 25.325.325.325.3 7.67.67.67.6 7.77.77.77.7 1.21.21.21.2 Asturian-Pashto 36.836.836.836.8 23.623.623.623.6 21.321.321.321.3 9.29.29.29.2 Croatian-Afrikaans 0.10.10.10.1 0.20.20.20.2 0.20.20.20.2 0.00.00.00.0 Croatian-Asturian 2.32.32.32.3 2.72.72.72.7 1.11.11.11.1 0.20.20.20.2 Croatian-Urdu 2.22.22.22.2 2.12.12.12.1 1.71.71.71.7 0.70.70.70.7 Croatian-Pashto 11.811.811.811.8 11.111.111.111.1 8.58.58.58.5 7.57.57.57.5 Urdu-Afrikaans 15.215.215.215.2 6.16.16.16.1 6.16.16.16.1 1.51.51.51.5 Urdu-Asturian 18.218.218.218.2 9.09.09.09.0 6.86.86.86.8 2.12.12.12.1 Urdu-Croatian 4.14.14.14.1 4.84.84.84.8 4.94.94.94.9 1.21.21.21.2 Urdu-Pashto 13.313.313.313.3 21.021.021.021.0 16.816.816.816.8 7.47.47.47.4 Pashto-Afrikaans 4.24.24.24.2 4.34.34.34.3 4.74.74.74.7 1.11.11.11.1 Pashto-Asturian 12.712.712.712.7 6.96.96.96.9 5.15.15.15.1 1.21.21.21.2 Pashto-Croatian 4.24.24.24.2 3.73.73.73.7 3.73.73.73.7 1.61.61.61.6 Pashto-Urdu 3.33.33.33.3 7.57.57.57.5 6.26.26.26.2 2.82.82.82.8 Average 10.910.910.910.9 7.67.67.67.6 6.76.76.76.7 2.82.82.82.8

Table 8: The percentage of hallucinations (TNG metric; lower is better) of different pivoting methods for M2M100 model on selected language pairs of FLORES-101 Goyal et al. (2022) benchmark.

Language Pairs Direct MultiAvg MaxEns EN Pivot Afrikaans-Asturian 2.52.52.52.5 1.61.61.61.6 0.20.20.20.2 2.12.12.12.1 Afrikaans-Croatian 0.00.00.00.0 0.10.10.10.1 0.20.20.20.2 0.10.10.10.1 Afrikaans-Urdu 0.40.40.40.4 0.30.30.30.3 0.60.60.60.6 0.20.20.20.2 Afrikaans-Pashto 2.92.92.92.9 1.81.81.81.8 1.21.21.21.2 1.31.31.31.3 Asturian-Afrikaans 0.40.40.40.4 0.50.50.50.5 0.90.90.90.9 0.20.20.20.2 Asturian-Croatian 1.21.21.21.2 0.40.40.40.4 0.70.70.70.7 0.40.40.40.4 Asturian-Urdu 6.16.16.16.1 0.70.70.70.7 0.90.90.90.9 0.20.20.20.2 Asturian-Pashto 5.05.05.05.0 2.02.02.02.0 1.91.91.91.9 0.90.90.90.9 Croatian-Afrikaans 0.00.00.00.0 0.10.10.10.1 0.10.10.10.1 0.00.00.00.0 Croatian-Asturian 2.12.12.12.1 1.31.31.31.3 0.80.80.80.8 2.22.22.22.2 Croatian-Urdu 0.80.80.80.8 0.40.40.40.4 0.40.40.40.4 0.30.30.30.3 Croatian-Pashto 2.62.62.62.6 1.71.71.71.7 1.31.31.31.3 1.21.21.21.2 Urdu-Afrikaans 1.11.11.11.1 0.60.60.60.6 0.40.40.40.4 0.30.30.30.3 Urdu-Asturian 13.413.413.413.4 3.33.33.33.3 1.21.21.21.2 2.62.62.62.6 Urdu-Croatian 0.80.80.80.8 0.40.40.40.4 0.70.70.70.7 0.40.40.40.4 Urdu-Pashto 5.35.35.35.3 2.32.32.32.3 2.12.12.12.1 1.01.01.01.0 Pashto-Afrikaans 1.21.21.21.2 2.92.92.92.9 2.22.22.22.2 0.40.40.40.4 Pashto-Asturian 19.419.419.419.4 5.45.45.45.4 3.13.13.13.1 2.72.72.72.7 Pashto-Croatian 3.03.03.03.0 2.12.12.12.1 2.32.32.32.3 0.30.30.30.3 Pashto-Urdu 1.61.61.61.6 2.22.22.22.2 2.42.42.42.4 0.40.40.40.4 Average 3.63.63.63.6 1.51.51.51.5 1.21.21.21.2 0.90.90.90.9

Table 9: The percentage of hallucinations (TNG metric; lower is better) of different pivoting methods for SMALL100 model on selected language pairs of FLORES-101 Goyal et al. (2022) benchmark.