-
M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation
Authors:
Benjamin Hsu,
Xiaoyu Liu,
Huayang Li,
Yoshinari Fu**uma,
Maria Nadejde,
Xing Niu,
Yair Kittenplon,
Ron Litman,
Raghavendra Pappagari
Abstract:
Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world…
▽ More
Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world documents often possess intricate text layouts that defy these assumptions. Extracting information from Optical Character Recognition (OCR) or heuristic rules can result in errors, and the layout (e.g., paragraphs, headers) may convey relationships between distant sections of text. This complexity is particularly evident in widely used PDF documents, which represent information visually. This paper addresses this gap by introducing M3T, a novel benchmark dataset tailored to evaluate NMT systems on the comprehensive task of translating semi-structured documents. This dataset aims to bridge the evaluation gap in document-level NMT systems, acknowledging the challenges posed by rich text layouts in real-world applications.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
RAMP: Retrieval and Attribute-Marking Enhanced Prompting for Attribute-Controlled Translation
Authors:
Gabriele Sarti,
Phu Mon Htut,
Xing Niu,
Benjamin Hsu,
Anna Currey,
Georgiana Dinu,
Maria Nadejde
Abstract:
Attribute-controlled translation (ACT) is a subtask of machine translation that involves controlling stylistic or linguistic attributes (like formality and gender) of translation outputs. While ACT has garnered attention in recent years due to its usefulness in real-world applications, progress in the task is currently limited by dataset availability, since most prior approaches rely on supervised…
▽ More
Attribute-controlled translation (ACT) is a subtask of machine translation that involves controlling stylistic or linguistic attributes (like formality and gender) of translation outputs. While ACT has garnered attention in recent years due to its usefulness in real-world applications, progress in the task is currently limited by dataset availability, since most prior approaches rely on supervised methods. To address this limitation, we propose Retrieval and Attribute-Marking enhanced Prompting (RAMP), which leverages large multilingual language models to perform ACT in few-shot and zero-shot settings. RAMP improves generation accuracy over the standard prompting approach by (1) incorporating a semantic similarity retrieval component for selecting similar in-context examples, and (2) marking in-context examples with attribute annotations. Our comprehensive experiments show that RAMP is a viable approach in both zero-shot and few-shot settings.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Pseudo-Label Training and Model Inertia in Neural Machine Translation
Authors:
Benjamin Hsu,
Anna Currey,
Xing Niu,
Maria Nădejde,
Georgiana Dinu
Abstract:
Like many other machine learning applications, neural machine translation (NMT) benefits from over-parameterized deep neural models. However, these models have been observed to be brittle: NMT model predictions are sensitive to small input changes and can show significant variation across re-training or incremental model updates. This work studies a frequently used method in NMT, pseudo-label trai…
▽ More
Like many other machine learning applications, neural machine translation (NMT) benefits from over-parameterized deep neural models. However, these models have been observed to be brittle: NMT model predictions are sensitive to small input changes and can show significant variation across re-training or incremental model updates. This work studies a frequently used method in NMT, pseudo-label training (PLT), which is common to the related techniques of forward-translation (or self-training) and sequence-level knowledge distillation. While the effect of PLT on quality is well-documented, we highlight a lesser-known effect: PLT can enhance a model's stability to model updates and input perturbations, a set of properties we call model inertia. We study inertia effects under different training settings and we identify distribution simplification as a mechanism behind the observed results.
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
MT-GenEval: A Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation
Authors:
Anna Currey,
Maria Nădejde,
Raghavendra Pappagari,
Mia Mayer,
Stanislas Lauly,
Xing Niu,
Benjamin Hsu,
Georgiana Dinu
Abstract:
As generic machine translation (MT) quality has improved, the need for targeted benchmarks that explore fine-grained aspects of quality has increased. In particular, gender accuracy in translation can have implications in terms of output fluency, translation accuracy, and ethics. In this paper, we introduce MT-GenEval, a benchmark for evaluating gender accuracy in translation from English into eig…
▽ More
As generic machine translation (MT) quality has improved, the need for targeted benchmarks that explore fine-grained aspects of quality has increased. In particular, gender accuracy in translation can have implications in terms of output fluency, translation accuracy, and ethics. In this paper, we introduce MT-GenEval, a benchmark for evaluating gender accuracy in translation from English into eight widely-spoken languages. MT-GenEval complements existing benchmarks by providing realistic, gender-balanced, counterfactual data in eight language pairs where the gender of individuals is unambiguous in the input segment, including multi-sentence segments requiring inter-sentential gender agreement. Our data and code is publicly available under a CC BY SA 3.0 license.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
A baseline revisited: Pushing the limits of multi-segment models for context-aware translation
Authors:
Suvodeep Majumder,
Stanislas Lauly,
Maria Nadejde,
Marcello Federico,
Georgiana Dinu
Abstract:
This paper addresses the task of contextual translation using multi-segment models. Specifically we show that increasing model capacity further pushes the limits of this approach and that deeper models are more suited to capture context dependencies. Furthermore, improvements observed with larger models can be transferred to smaller models using knowledge distillation. Our experiments show that th…
▽ More
This paper addresses the task of contextual translation using multi-segment models. Specifically we show that increasing model capacity further pushes the limits of this approach and that deeper models are more suited to capture context dependencies. Furthermore, improvements observed with larger models can be transferred to smaller models using knowledge distillation. Our experiments show that this approach achieves competitive performance across several languages and benchmarks, without additional language-specific tuning and task specific architectures.
△ Less
Submitted 21 October, 2022; v1 submitted 19 October, 2022;
originally announced October 2022.
-
Sockeye 3: Fast Neural Machine Translation with PyTorch
Authors:
Felix Hieber,
Michael Denkowski,
Tobias Domhan,
Barbara Darques Barros,
Celina Dong Ye,
Xing Niu,
Cuong Hoang,
Ke Tran,
Benjamin Hsu,
Maria Nadejde,
Surafel Lakew,
Prashant Mathur,
Anna Currey,
Marcello Federico
Abstract:
Sockeye 3 is the latest version of the Sockeye toolkit for Neural Machine Translation (NMT). Now based on PyTorch, Sockeye 3 provides faster model implementations and more advanced features with a further streamlined codebase. This enables broader experimentation with faster iteration, efficient training of stronger and faster models, and the flexibility to move new ideas quickly from research to…
▽ More
Sockeye 3 is the latest version of the Sockeye toolkit for Neural Machine Translation (NMT). Now based on PyTorch, Sockeye 3 provides faster model implementations and more advanced features with a further streamlined codebase. This enables broader experimentation with faster iteration, efficient training of stronger and faster models, and the flexibility to move new ideas quickly from research to production. When running comparable models, Sockeye 3 is up to 126% faster than other PyTorch implementations on GPUs and up to 292% faster on CPUs. Sockeye 3 is open source software released under the Apache 2.0 license.
△ Less
Submitted 2 August, 2022; v1 submitted 12 July, 2022;
originally announced July 2022.
-
CoCoA-MT: A Dataset and Benchmark for Contrastive Controlled MT with Application to Formality
Authors:
Maria Nădejde,
Anna Currey,
Benjamin Hsu,
Xing Niu,
Marcello Federico,
Georgiana Dinu
Abstract:
The machine translation (MT) task is typically formulated as that of returning a single translation for an input segment. However, in many cases, multiple different translations are valid and the appropriate translation may depend on the intended target audience, characteristics of the speaker, or even the relationship between speakers. Specific problems arise when dealing with honorifics, particu…
▽ More
The machine translation (MT) task is typically formulated as that of returning a single translation for an input segment. However, in many cases, multiple different translations are valid and the appropriate translation may depend on the intended target audience, characteristics of the speaker, or even the relationship between speakers. Specific problems arise when dealing with honorifics, particularly translating from English into languages with formality markers. For example, the sentence "Are you sure?" can be translated in German as "Sind Sie sich sicher?" (formal register) or "Bist du dir sicher?" (informal). Using wrong or inconsistent tone may be perceived as inappropriate or jarring for users of certain cultures and demographics. This work addresses the problem of learning to control target language attributes, in this case formality, from a small amount of labeled contrastive data. We introduce an annotated dataset (CoCoA-MT) and an associated evaluation metric for training and evaluating formality-controlled MT models for six diverse target languages. We show that we can train formality-controlled models by fine-tuning on labeled contrastive data, achieving high accuracy (82% in-domain and 73% out-of-domain) while maintaining overall quality.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.
-
Personalizing Grammatical Error Correction: Adaptation to Proficiency Level and L1
Authors:
Maria Nadejde,
Joel Tetreault
Abstract:
Grammar error correction (GEC) systems have become ubiquitous in a variety of software applications, and have started to approach human-level performance for some datasets. However, very little is known about how to efficiently personalize these systems to the user's characteristics, such as their proficiency level and first language, or to emerging domains of text. We present the first results on…
▽ More
Grammar error correction (GEC) systems have become ubiquitous in a variety of software applications, and have started to approach human-level performance for some datasets. However, very little is known about how to efficiently personalize these systems to the user's characteristics, such as their proficiency level and first language, or to emerging domains of text. We present the first results on adapting a general-purpose neural GEC system to both the proficiency level and the first language of a writer, using only a few thousand annotated sentences. Our study is the broadest of its kind, covering five proficiency levels and twelve different languages, and comparing three different adaptation scenarios: adapting to the proficiency level only, to the first language only, or to both aspects simultaneously. We show that tailoring to both scenarios achieves the largest performance improvement (3.6 F0.5) relative to a strong baseline.
△ Less
Submitted 4 June, 2020;
originally announced June 2020.
-
Nematus: a Toolkit for Neural Machine Translation
Authors:
Rico Sennrich,
Orhan Firat,
Kyunghyun Cho,
Alexandra Birch,
Barry Haddow,
Julian Hitschler,
Marcin Junczys-Dowmunt,
Samuel Läubli,
Antonio Valerio Miceli Barone,
Jozef Mokry,
Maria Nădejde
Abstract:
We present Nematus, a toolkit for Neural Machine Translation. The toolkit prioritizes high translation accuracy, usability, and extensibility. Nematus has been used to build top-performing submissions to shared translation tasks at WMT and IWSLT, and has been used to train systems for production environments.
We present Nematus, a toolkit for Neural Machine Translation. The toolkit prioritizes high translation accuracy, usability, and extensibility. Nematus has been used to build top-performing submissions to shared translation tasks at WMT and IWSLT, and has been used to train systems for production environments.
△ Less
Submitted 13 March, 2017;
originally announced March 2017.
-
Predicting Target Language CCG Supertags Improves Neural Machine Translation
Authors:
Maria Nadejde,
Siva Reddy,
Rico Sennrich,
Tomasz Dwojak,
Marcin Junczys-Dowmunt,
Philipp Koehn,
Alexandra Birch
Abstract:
Neural machine translation (NMT) models are able to partially learn syntactic information from sequential lexical information. Still, some complex syntactic phenomena such as prepositional phrase attachment are poorly modeled. This work aims to answer two questions: 1) Does explicitly modeling target language syntax help NMT? 2) Is tight integration of words and syntax better than multitask traini…
▽ More
Neural machine translation (NMT) models are able to partially learn syntactic information from sequential lexical information. Still, some complex syntactic phenomena such as prepositional phrase attachment are poorly modeled. This work aims to answer two questions: 1) Does explicitly modeling target language syntax help NMT? 2) Is tight integration of words and syntax better than multitask training? We introduce syntactic information in the form of CCG supertags in the decoder, by interleaving the target supertags with the word sequence. Our results on WMT data show that explicitly modeling target-syntax improves machine translation quality for German->English, a high-resource pair, and for Romanian->English, a low-resource pair and also several syntactic phenomena including prepositional phrase attachment. Furthermore, a tight coupling of words and syntax improves translation quality more than multitask training. By combining target-syntax with adding source-side dependency labels in the embedding layer, we obtain a total improvement of 0.9 BLEU for German->English and 1.2 BLEU for Romanian->English.
△ Less
Submitted 18 July, 2017; v1 submitted 3 February, 2017;
originally announced February 2017.