Search | arXiv e-print repository

Investigating the translation capabilities of Large Language Models trained on parallel data only

Authors: Javier García Gilabert, Carlos Escolano, Aleix Sant Savall, Francesca De Luca Fornaciari, Audrey Mash, Xixian Liao, Maite Melero

Abstract: In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this… ▽ More In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce PLUME (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: We release our code at: https://github.com/projecte-aina/Plume

arXiv:2309.17134 [pdf, other]

Promoting Generalized Cross-lingual Question Answering in Few-resource Scenarios via Self-knowledge Distillation

Authors: Casimiro Pio Carrino, Carlos Escolano, José A. R. Fonollosa

Abstract: Despite substantial progress in multilingual extractive Question Answering (QA), models with high and uniformly distributed performance across languages remain challenging, especially for languages with limited resources. We study cross-lingual transfer mainly focusing on the Generalized Cross-Lingual Transfer (G-XLT) task, where the question language differs from the context language - a challeng… ▽ More Despite substantial progress in multilingual extractive Question Answering (QA), models with high and uniformly distributed performance across languages remain challenging, especially for languages with limited resources. We study cross-lingual transfer mainly focusing on the Generalized Cross-Lingual Transfer (G-XLT) task, where the question language differs from the context language - a challenge that has received limited attention thus far. Our approach seeks to enhance cross-lingual QA transfer using a high-performing multilingual model trained on a large-scale dataset, complemented by a few thousand aligned QA examples across languages. Our proposed strategy combines cross-lingual sampling and advanced self-distillation training in generations to tackle the previous challenge. Notably, we introduce the novel mAP@k coefficients to fine-tune self-knowledge distillation loss, dynamically regulating the teacher's model knowledge to perform a balanced and effective knowledge transfer. We extensively evaluate our approach to assess XLT and G-XLT capabilities in extractive QA. Results reveal that our self-knowledge distillation approach outperforms standard cross-entropy fine-tuning by a significant margin. Importantly, when compared to a strong baseline that leverages a sizeable volume of machine-translated data, our approach shows competitive results despite the considerable challenge of operating within resource-constrained settings, even in zero-shot scenarios. Beyond performance improvements, we offer valuable insights through comprehensive analyses and an ablation study, further substantiating the benefits and constraints of our approach. In essence, we propose a practical solution to improve cross-lingual QA transfer by leveraging a few data resources in an efficient way. △ Less

Submitted 29 September, 2023; originally announced September 2023.

Comments: Submitted to the Journal of Artificial Intelligence Research (JAIR)

arXiv:2305.11761 [pdf, other]

ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translation

Authors: Javier García Gilabert, Carlos Escolano, Marta R. Costa-Jussà

Abstract: Our proposed method, ReSeTOX (REdo SEarch if TOXic), addresses the issue of Neural Machine Translation (NMT) generating translation outputs that contain toxic words not present in the input. The objective is to mitigate the introduction of toxic language without the need for re-training. In the case of identified added toxicity during the inference process, ReSeTOX dynamically adjusts the key-valu… ▽ More Our proposed method, ReSeTOX (REdo SEarch if TOXic), addresses the issue of Neural Machine Translation (NMT) generating translation outputs that contain toxic words not present in the input. The objective is to mitigate the introduction of toxic language without the need for re-training. In the case of identified added toxicity during the inference process, ReSeTOX dynamically adjusts the key-value self-attention weights and re-evaluates the beam search hypotheses. Experimental results demonstrate that ReSeTOX achieves a remarkable 57% reduction in added toxicity while maintaining an average translation quality of 99.5% across 164 languages. △ Less

Submitted 19 May, 2023; originally announced May 2023.

arXiv:2210.03070 [pdf, other]

Toxicity in Multilingual Machine Translation at Scale

Authors: Marta R. Costa-jussà, Eric Smith, Christophe Ropers, Daniel Licht, Jean Maillard, Javier Ferrando, Carlos Escolano

Abstract: Machine Translation systems can produce different types of errors, some of which are characterized as critical or catastrophic due to the specific negative impact that they can have on users. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demogra… ▽ More Machine Translation systems can produce different types of errors, some of which are characterized as critical or catastrophic due to the specific negative impact that they can have on users. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demographic axes) from English into 164 languages. An automatic toxicity evaluation shows that added toxicity across languages varies from 0% to 5%. The output languages with the most added toxicity tend to be low-resource ones, and the demographic axes with the most added toxicity include sexual orientation, gender and sex, and ability. We also perform human evaluation on a subset of 8 translation directions, confirming the prevalence of true added toxicity. We use a measurement of the amount of source contribution to the translation, where a low source contribution implies hallucination, to interpret what causes toxicity. Making use of the input attributions allows us to explain toxicity, because the source contributions significantly correlate with toxicity for 84% of languages studied. Given our findings, our recommendations to reduce added toxicity are to curate training data to avoid mistranslations, mitigate hallucination and check unstable translations. △ Less

Submitted 5 April, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

ACM Class: I.2.7

arXiv:2205.11631 [pdf, other]

Towards Opening the Black Box of Neural Machine Translation: Source and Target Interpretations of the Transformer

Authors: Javier Ferrando, Gerard I. Gállego, Belen Alastruey, Carlos Escolano, Marta R. Costa-jussà

Abstract: In Neural Machine Translation (NMT), each token prediction is conditioned on the source sentence and the target prefix (what has been previously translated at a decoding step). However, previous work on interpretability in NMT has mainly focused solely on source sentence tokens' attributions. Therefore, we lack a full understanding of the influences of every input token (source sentence and target… ▽ More In Neural Machine Translation (NMT), each token prediction is conditioned on the source sentence and the target prefix (what has been previously translated at a decoding step). However, previous work on interpretability in NMT has mainly focused solely on source sentence tokens' attributions. Therefore, we lack a full understanding of the influences of every input token (source sentence and target prefix) in the model predictions. In this work, we propose an interpretability method that tracks input tokens' attributions for both contexts. Our method, which can be extended to any encoder-decoder Transformer-based model, allows us to better comprehend the inner workings of current NMT models. We apply the proposed method to both bilingual and multilingual Transformers and present insights into their behaviour. △ Less

Submitted 4 November, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: EMNLP 2022

arXiv:2202.06041 [pdf, other]

A multi-task semi-supervised framework for Text2Graph & Graph2Text

Authors: Oriol Domingo, Marta R. Costa-jussà, Carlos Escolano

Abstract: The Artificial Intelligence industry regularly develops applications that mostly rely on Knowledge Bases, a data repository about specific, or general, domains, usually represented in a graph shape. Similar to other databases, they face two main challenges: information ingestion and information retrieval. We approach these challenges by jointly learning graph extraction from text and text generati… ▽ More The Artificial Intelligence industry regularly develops applications that mostly rely on Knowledge Bases, a data repository about specific, or general, domains, usually represented in a graph shape. Similar to other databases, they face two main challenges: information ingestion and information retrieval. We approach these challenges by jointly learning graph extraction from text and text generation from graphs. The proposed solution, a T5 architecture, is trained in a multi-task semi-supervised environment, with our collected non-parallel data, following a cycle training regime. Experiments on WebNLG dataset show that our approach surpasses unsupervised state-of-the-art results in text-to-graph and graph-to-text. More relevantly, our framework is more consistent across seen and unseen domains than supervised models. The resulting model can be easily trained in any new domain with non-parallel data, by simply adding text and graphs about it, in our cycle framework. △ Less

Submitted 18 February, 2022; v1 submitted 12 February, 2022; originally announced February 2022.

Comments: 5 pages, 2 figures, 3 tables and 8 equations

arXiv:2105.04512 [pdf, other]

End-to-End Speech Translation with Pre-trained Models and Adapters: UPC at IWSLT 2021

Authors: Gerard I. Gállego, Ioannis Tsiamas, Carlos Escolano, José A. R. Fonollosa, Marta R. Costa-jussà

Abstract: This paper describes the submission to the IWSLT 2021 offline speech translation task by the UPC Machine Translation group. The task consists of building a system capable of translating English audio recordings extracted from TED talks into German text. Submitted systems can be either cascade or end-to-end and use a custom or given segmentation. Our submission is an end-to-end speech translation s… ▽ More This paper describes the submission to the IWSLT 2021 offline speech translation task by the UPC Machine Translation group. The task consists of building a system capable of translating English audio recordings extracted from TED talks into German text. Submitted systems can be either cascade or end-to-end and use a custom or given segmentation. Our submission is an end-to-end speech translation system, which combines pre-trained models (Wav2Vec 2.0 and mBART) with coupling modules between the encoder and decoder, and uses an efficient fine-tuning technique, which trains only 20% of its total parameters. We show that adding an Adapter to the system and pre-training it, can increase the convergence speed and the final result, with which we achieve a BLEU score of 27.3 on the MuST-C test set. Our final model is an ensemble that obtains 28.22 BLEU score on the same set. Our submission also uses a custom segmentation algorithm that employs pre-trained Wav2Vec 2.0 for identifying periods of untranscribable text and can bring improvements of 2.5 to 3 BLEU score on the IWSLT 2019 test set, as compared to the result with the given segmentation. △ Less

Submitted 28 June, 2021; v1 submitted 10 May, 2021; originally announced May 2021.

Comments: Submitted to IWSLT 2021; changed the title and added submission results

arXiv:2012.13176 [pdf, other]

Gender Bias in Multilingual Neural Machine Translation: The Architecture Matters

Authors: Marta R. Costa-jussà, Carlos Escolano, Christine Basta, Javier Ferrando, Roser Batlle, Ksenia Kharitonova

Abstract: Multilingual Neural Machine Translation architectures mainly differ in the amount of sharing modules and parameters among languages. In this paper, and from an algorithmic perspective, we explore if the chosen architecture, when trained with the same data, influences the gender bias accuracy. Experiments in four language pairs show that Language-Specific encoders-decoders exhibit less bias than th… ▽ More Multilingual Neural Machine Translation architectures mainly differ in the amount of sharing modules and parameters among languages. In this paper, and from an algorithmic perspective, we explore if the chosen architecture, when trained with the same data, influences the gender bias accuracy. Experiments in four language pairs show that Language-Specific encoders-decoders exhibit less bias than the Shared encoder-decoder architecture. Further interpretability analysis of source embeddings and the attention shows that, in the Language-Specific case, the embeddings encode more gender information, and its attention is more diverted. Both behaviors help in mitigating gender bias. △ Less

Submitted 24 December, 2020; originally announced December 2020.

Comments: 12 pages, 5 figures, 3 tables

ACM Class: I.2.7

arXiv:2011.01097 [pdf, other]

Enabling Zero-shot Multilingual Spoken Language Translation with Language-Specific Encoders and Decoders

Authors: Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa, Carlos Segura

Abstract: Current end-to-end approaches to Spoken Language Translation (SLT) rely on limited training resources, especially for multilingual settings. On the other hand, Multilingual Neural Machine Translation (MultiNMT) approaches rely on higher-quality and more massive data sets. Our proposed method extends a MultiNMT architecture based on language-specific encoders-decoders to the task of Multilingual SL… ▽ More Current end-to-end approaches to Spoken Language Translation (SLT) rely on limited training resources, especially for multilingual settings. On the other hand, Multilingual Neural Machine Translation (MultiNMT) approaches rely on higher-quality and more massive data sets. Our proposed method extends a MultiNMT architecture based on language-specific encoders-decoders to the task of Multilingual SLT (MultiSLT). Our method entirely eliminates the dependency from MultiSLT data and it is able to translate while training only on ASR and MultiNMT data. Our experiments on four different languages show that coupling the speech encoder to the MultiNMT architecture produces similar quality translations compared to a bilingual baseline ($\pm 0.2$ BLEU) while effectively allowing for zero-shot MultiSLT. Additionally, we propose using an Adapter module for coupling the speech inputs. This Adapter module produces consistent improvements up to +6 BLEU points on the proposed architecture and +1 BLEU point on the end-to-end baseline. △ Less

Submitted 15 September, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

ACM Class: I.2.7

Journal ref: IEEE Workshop on Automatic Speech Recognition and Understanding 2021

arXiv:2006.01594 [pdf, other]

Training Multilingual Machine Translation by Alternately Freezing Language-Specific Encoders-Decoders

Authors: Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa, Mikel Artetxe

Abstract: We propose a modular architecture of language-specific encoder-decoders that constitutes a multilingual machine translation system that can be incrementally extended to new languages without the need for retraining the existing system when adding new languages. Differently from previous works, we simultaneously train $N$ languages in all translation directions by alternately freezing encoder or de… ▽ More We propose a modular architecture of language-specific encoder-decoders that constitutes a multilingual machine translation system that can be incrementally extended to new languages without the need for retraining the existing system when adding new languages. Differently from previous works, we simultaneously train $N$ languages in all translation directions by alternately freezing encoder or decoder modules, which indirectly forces the system to train in a common intermediate representation for all languages. Experimental results from multilingual machine translation show that we can successfully train this modular architecture improving on the initial languages while falling slightly behind when adding new languages or doing zero-shot translation. Additional comparison of the quality of sentence representation in the task of natural language inference shows that the alternately freezing training is also beneficial in this direction. △ Less

Submitted 29 May, 2020; originally announced June 2020.

Comments: arXiv admin note: text overlap with arXiv:2004.06575

ACM Class: I.2.7

arXiv:2004.08053 [pdf, other]

Enriching the Transformer with Linguistic Factors for Low-Resource Machine Translation

Authors: Jordi Armengol-Estapé, Marta R. Costa-jussà, Carlos Escolano

Abstract: Introducing factors, that is to say, word features such as linguistic information referring to the source tokens, is known to improve the results of neural machine translation systems in certain settings, typically in recurrent architectures. This study proposes enhancing the current state-of-the-art neural machine translation architecture, the Transformer, so that it allows to introduce external… ▽ More Introducing factors, that is to say, word features such as linguistic information referring to the source tokens, is known to improve the results of neural machine translation systems in certain settings, typically in recurrent architectures. This study proposes enhancing the current state-of-the-art neural machine translation architecture, the Transformer, so that it allows to introduce external knowledge. In particular, our proposed modification, the Factored Transformer, uses linguistic factors that insert additional knowledge into the machine translation system. Apart from using different kinds of features, we study the effect of different architectural configurations. Specifically, we analyze the performance of combining words and features at the embedding level or at the encoder level, and we experiment with two different combination strategies. With the best-found configuration, we show improvements of 0.8 BLEU over the baseline Transformer in the IWSLT German-to-English task. Moreover, we experiment with the more challenging FLoRes English-to-Nepali benchmark, which includes both extremely low-resourced and very distant languages, and obtain an improvement of 1.2 BLEU. △ Less

Submitted 24 December, 2020; v1 submitted 16 April, 2020; originally announced April 2020.

ACM Class: I.2.7

arXiv:2004.06575 [pdf, ps, other]

Multilingual Machine Translation: Closing the Gap between Shared and Language-specific Encoder-Decoders

Authors: Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa, Mikel Artetxe

Abstract: State-of-the-art multilingual machine translation relies on a universal encoder-decoder, which requires retraining the entire system to add new languages. In this paper, we propose an alternative approach that is based on language-specific encoder-decoders, and can thus be more easily extended to new languages by learning their corresponding modules. So as to encourage a common interlingua represe… ▽ More State-of-the-art multilingual machine translation relies on a universal encoder-decoder, which requires retraining the entire system to add new languages. In this paper, we propose an alternative approach that is based on language-specific encoder-decoders, and can thus be more easily extended to new languages by learning their corresponding modules. So as to encourage a common interlingua representation, we simultaneously train the N initial languages. Our experiments show that the proposed approach outperforms the universal encoder-decoder by 3.28 BLEU points on average, and when adding new languages, without the need to retrain the rest of the modules. All in all, our work closes the gap between shared and language-specific encoder-decoders, advancing toward modular multilingual machine translation systems that can be flexibly extended in lifelong learning settings. △ Less

Submitted 14 April, 2020; originally announced April 2020.

ACM Class: I.2.7

arXiv:1907.00810 [pdf, other]

Multilingual, Multi-scale and Multi-layer Visualization of Intermediate Representations

Authors: Carlos Escolano, Marta R. Costa-jussà, Elora Lacroux, Pere-Pau Vázquez

Abstract: The main alternatives nowadays to deal with sequences are Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN) architectures and the Transformer. In this context, RNN's, CNN's and Transformer have most commonly been used as an encoder-decoder architecture with multiple layers in each module. Far beyond this, these architectures are the basis for the contextual word embeddings which… ▽ More The main alternatives nowadays to deal with sequences are Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN) architectures and the Transformer. In this context, RNN's, CNN's and Transformer have most commonly been used as an encoder-decoder architecture with multiple layers in each module. Far beyond this, these architectures are the basis for the contextual word embeddings which are revolutionizing most natural language downstream applications. However, intermediate layer representations in sequence-based architectures can be difficult to interpret. To make each layer representation within these architectures more accessible and meaningful, we introduce a web-based tool that visualizes them both at the sentence and token level. We present three use cases. The first analyses gender issues in contextual word embeddings. The second and third are showing multilingual intermediate representations for sentences and tokens and the evolution of these intermediate representations along the multiple layers of the decoder and in the context of multilingual machine translation. △ Less

Submitted 1 July, 2019; originally announced July 2019.

arXiv:1907.00735 [pdf, other]

From Bilingual to Multilingual Neural Machine Translation by Incremental Training

Authors: Carlos Escolano, Marta R. Costa-Jussà, José A. R. Fonollosa

Abstract: Multilingual Neural Machine Translation approaches are based on the use of task-specific models and the addition of one more language can only be done by retraining the whole system. In this work, we propose a new training schedule that allows the system to scale to more languages without modification of the previous components based on joint training and language-independent encoder/decoder modul… ▽ More Multilingual Neural Machine Translation approaches are based on the use of task-specific models and the addition of one more language can only be done by retraining the whole system. In this work, we propose a new training schedule that allows the system to scale to more languages without modification of the previous components based on joint training and language-independent encoder/decoder modules allowing for zero-shot translation. This work in progress shows close results to the state-of-the-art in the WMT task. △ Less

Submitted 11 July, 2019; v1 submitted 28 June, 2019; originally announced July 2019.

Comments: Accepted paper at ACL 2019 Student Research Workshop. arXiv admin note: substantial text overlap with arXiv:1905.06831

arXiv:1905.06831 [pdf, other]

Towards Interlingua Neural Machine Translation

Authors: Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa

Abstract: Common intermediate language representation in neural machine translation can be used to extend bilingual to multilingual systems by incremental training. In this paper, we propose a new architecture based on introducing an interlingual loss as an additional training objective. By adding and forcing this interlingual loss, we are able to train multiple encoders and decoders for each language, shar… ▽ More Common intermediate language representation in neural machine translation can be used to extend bilingual to multilingual systems by incremental training. In this paper, we propose a new architecture based on introducing an interlingual loss as an additional training objective. By adding and forcing this interlingual loss, we are able to train multiple encoders and decoders for each language, sharing a common intermediate representation. Translation results on the low-resourced tasks (Turkish-English and Kazakh-English tasks, from the popular Workshop on Machine Translation benchmark) show the following BLEU improvements up to 2.8. However, results on a larger dataset (Russian-English and Kazakh-English, from the same baselines) show BLEU loses if the same amount. While our system is only providing improvements for the low-resourced tasks in terms of translation quality, our system is capable of quickly deploying new language pairs without retraining the rest of the system, which may be a game-changer in some situations (i.e. in a disaster crisis where international help is required towards a small region or to develop some translation system for a client). Precisely, what is most relevant from our architecture is that it is capable of: (1) reducing the number of production systems, with respect to the number of languages, from quadratic to linear (2) incrementally adding a new language in the system without retraining languages previously there and (3) allowing for translations from the new language to all the others present in the system △ Less

Submitted 8 December, 2019; v1 submitted 15 May, 2019; originally announced May 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1810.06351

arXiv:1810.06351 [pdf, other]

(Self-Attentive) Autoencoder-based Universal Language Representation for Machine Translation

Authors: Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa

Abstract: Universal language representation is the holy grail in machine translation (MT). Thanks to the new neural MT approach, it seems that there are good perspectives towards this goal. In this paper, we propose a new architecture based on combining variational autoencoders with encoder-decoders and introducing an interlingual loss as an additional training objective. By adding and forcing this interlin… ▽ More Universal language representation is the holy grail in machine translation (MT). Thanks to the new neural MT approach, it seems that there are good perspectives towards this goal. In this paper, we propose a new architecture based on combining variational autoencoders with encoder-decoders and introducing an interlingual loss as an additional training objective. By adding and forcing this interlingual loss, we are able to train multiple encoders and decoders for each language, sharing a common universal representation. Since the final objective of this universal representation is producing close results for similar input sentences (in any language), we propose to evaluate it by encoding the same sentence in two different languages, decoding both latent representations into the same language and comparing both outputs. Preliminary results on the WMT 2017 Turkish/English task shows that the proposed architecture is capable of learning a universal language representation and simultaneously training both translation directions with state-of-the-art results. △ Less

Submitted 15 October, 2018; originally announced October 2018.

Comments: 7 pages, 4 figures

arXiv:1610.02209 [pdf, ps, other]

Morphology Generation for Statistical Machine Translation using Deep Learning Techniques

Authors: Marta R. Costa-jussà, Carlos Escolano

Abstract: Morphology in unbalanced languages remains a big challenge in the context of machine translation. In this paper, we propose to de-couple machine translation from morphology generation in order to better deal with the problem. We investigate the morphology simplification with a reasonable trade-off between expected gain and generation complexity. For the Chinese-Spanish task, optimum morphological… ▽ More Morphology in unbalanced languages remains a big challenge in the context of machine translation. In this paper, we propose to de-couple machine translation from morphology generation in order to better deal with the problem. We investigate the morphology simplification with a reasonable trade-off between expected gain and generation complexity. For the Chinese-Spanish task, optimum morphological simplification is in gender and number. For this purpose, we design a new classification architecture which, compared to other standard machine learning techniques, obtains the best results. This proposed neural-based architecture consists of several layers: an embedding, a convolutional followed by a recurrent neural network and, finally, ends with sigmoid and softmax layers. We obtain classification results over 98% accuracy in gender classification, over 93% in number classification, and an overall translation improvement of 0.7 METEOR. △ Less

Submitted 6 February, 2017; v1 submitted 7 October, 2016; originally announced October 2016.

Showing 1–17 of 17 results for author: Escolano, C