-
The Fine-Tuning Paradox: Boosting Translation Quality Without Sacrificing LLM Abilities
Authors:
David Stap,
Eva Hasler,
Bill Byrne,
Christof Monz,
Ke Tran
Abstract:
Fine-tuning large language models (LLMs) for machine translation has shown improvements in overall translation quality. However, it is unclear what is the impact of fine-tuning on desirable LLM behaviors that are not present in neural machine translation models, such as steerability, inherent document-level translation abilities, and the ability to produce less literal translations. We perform an…
▽ More
Fine-tuning large language models (LLMs) for machine translation has shown improvements in overall translation quality. However, it is unclear what is the impact of fine-tuning on desirable LLM behaviors that are not present in neural machine translation models, such as steerability, inherent document-level translation abilities, and the ability to produce less literal translations. We perform an extensive translation evaluation on the LLaMA and Falcon family of models with model size ranging from 7 billion up to 65 billion parameters. Our results show that while fine-tuning improves the general translation quality of LLMs, several abilities degrade. In particular, we observe a decline in the ability to perform formality steering, to produce technical translations through few-shot examples, and to perform document-level translation. On the other hand, we observe that the model produces less literal translations after fine-tuning on parallel data. We show that by including monolingual data as part of the fine-tuning data we can maintain the abilities while simultaneously enhancing overall translation quality. Our findings emphasize the need for fine-tuning strategies that preserve the benefits of LLMs for machine translation.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
A Preference-driven Paradigm for Enhanced Translation with Large Language Models
Authors:
Dawei Zhu,
Sony Trenous,
Xiaoyu Shen,
Dietrich Klakow,
Bill Byrne,
Eva Hasler
Abstract:
Recent research has shown that large language models (LLMs) can achieve remarkable translation performance through supervised fine-tuning (SFT) using only a small amount of parallel data. However, SFT simply instructs the model to imitate the reference translations at the token level, making it vulnerable to the noise present in the references. Hence, the assistance from SFT often reaches a platea…
▽ More
Recent research has shown that large language models (LLMs) can achieve remarkable translation performance through supervised fine-tuning (SFT) using only a small amount of parallel data. However, SFT simply instructs the model to imitate the reference translations at the token level, making it vulnerable to the noise present in the references. Hence, the assistance from SFT often reaches a plateau once the LLMs have achieved a certain level of translation capability, and further increasing the size of parallel data does not provide additional benefits. To overcome this plateau associated with imitation-based SFT, we propose a preference-based approach built upon the Plackett-Luce model. The objective is to steer LLMs towards a more nuanced understanding of translation preferences from a holistic view, while also being more resilient in the absence of gold translations. We further build a dataset named MAPLE to verify the effectiveness of our approach, which includes multiple translations of varying quality for each source sentence. Extensive experiments demonstrate the superiority of our approach in "breaking the plateau" across diverse LLMs and test settings. Our in-depth analysis underscores the pivotal role of diverse translations and accurate preference scores in the success of our approach.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Trained MT Metrics Learn to Cope with Machine-translated References
Authors:
Jannis Vamvas,
Tobias Domhan,
Sony Trenous,
Rico Sennrich,
Eva Hasler
Abstract:
Neural metrics trained on human evaluations of MT tend to correlate well with human judgments, but their behavior is not fully understood. In this paper, we perform a controlled experiment and compare a baseline metric that has not been trained on human evaluations (Prism) to a trained version of the same metric (Prism+FT). Surprisingly, we find that Prism+FT becomes more robust to machine-transla…
▽ More
Neural metrics trained on human evaluations of MT tend to correlate well with human judgments, but their behavior is not fully understood. In this paper, we perform a controlled experiment and compare a baseline metric that has not been trained on human evaluations (Prism) to a trained version of the same metric (Prism+FT). Surprisingly, we find that Prism+FT becomes more robust to machine-translated references, which are a notorious problem in MT evaluation. This suggests that the effects of metric training go beyond the intended effect of improving overall correlation with human judgments.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
A tale of two tails: 130 years of growth-at-risk
Authors:
Martin Gächter,
Elias Hasler,
Florian Huber
Abstract:
We extend the existing growth-at-risk (GaR) literature by examining a long time period of 130 years in a time-varying parameter regression model. We identify several important insights for policymakers. First, both the level as well as the determinants of GaR vary significantly over time. Second, the stability of upside risks to GDP growth reported in earlier research is specific to the period kno…
▽ More
We extend the existing growth-at-risk (GaR) literature by examining a long time period of 130 years in a time-varying parameter regression model. We identify several important insights for policymakers. First, both the level as well as the determinants of GaR vary significantly over time. Second, the stability of upside risks to GDP growth reported in earlier research is specific to the period known as the Great Moderation, with the distribution of risks being more balanced before the 1970s. Third, the distribution of GDP growth has significantly narrowed since the end of the Bretton Woods system. Fourth, financial stress is always linked to higher downside risks, but it does not affect upside risks. Finally, other risk indicators, such as credit growth and house prices, not only drive downside risks, but also contribute to increased upside risks during boom periods. In this context, the paper also adds to the financial cycle literature by completing the picture of drivers (and risks) for both booms and recessions over time.
△ Less
Submitted 17 February, 2023;
originally announced February 2023.
-
Analyzing the Use of Influence Functions for Instance-Specific Data Filtering in Neural Machine Translation
Authors:
Tsz Kin Lam,
Eva Hasler,
Felix Hieber
Abstract:
Customer feedback can be an important signal for improving commercial machine translation systems. One solution for fixing specific translation errors is to remove the related erroneous training instances followed by re-training of the machine translation system, which we refer to as instance-specific data filtering. Influence functions (IF) have been shown to be effective in finding such relevant…
▽ More
Customer feedback can be an important signal for improving commercial machine translation systems. One solution for fixing specific translation errors is to remove the related erroneous training instances followed by re-training of the machine translation system, which we refer to as instance-specific data filtering. Influence functions (IF) have been shown to be effective in finding such relevant training examples for classification tasks such as image classification, toxic speech detection and entailment task. Given a probing instance, IF find influential training examples by measuring the similarity of the probing instance with a set of training examples in gradient space. In this work, we examine the use of influence functions for Neural Machine Translation (NMT). We propose two effective extensions to a state of the art influence function and demonstrate on the sub-problem of copied training examples that IF can be applied more generally than handcrafted regular expressions.
△ Less
Submitted 24 October, 2022;
originally announced October 2022.
-
Automatic Evaluation and Analysis of Idioms in Neural Machine Translation
Authors:
Christos Baziotis,
Prashant Mathur,
Eva Hasler
Abstract:
A major open problem in neural machine translation (NMT) is the translation of idiomatic expressions, such as "under the weather". The meaning of these expressions is not composed by the meaning of their constituent words, and NMT models tend to translate them literally (i.e., word-by-word), which leads to confusing and nonsensical translations. Research on idioms in NMT is limited and obstructed…
▽ More
A major open problem in neural machine translation (NMT) is the translation of idiomatic expressions, such as "under the weather". The meaning of these expressions is not composed by the meaning of their constituent words, and NMT models tend to translate them literally (i.e., word-by-word), which leads to confusing and nonsensical translations. Research on idioms in NMT is limited and obstructed by the absence of automatic methods for quantifying these errors. In this work, first, we propose a novel metric for automatically measuring the frequency of literal translation errors without human involvement. Equipped with this metric, we present controlled translation experiments with models trained in different conditions (with/without the test-set idioms) and across a wide range of (global and targeted) metrics and test sets. We explore the role of monolingual pretraining and find that it yields substantial targeted improvements, even without observing any translation examples of the test-set idioms. In our analysis, we probe the role of idiom context. We find that the randomly initialized models are more local or "myopic" as they are relatively unaffected by variations of the idiom context, unlike the pretrained ones.
△ Less
Submitted 10 October, 2022;
originally announced October 2022.
-
The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation
Authors:
Tobias Domhan,
Eva Hasler,
Ke Tran,
Sony Trenous,
Bill Byrne,
Felix Hieber
Abstract:
Vocabulary selection, or lexical shortlisting, is a well-known technique to improve latency of Neural Machine Translation models by constraining the set of allowed output words during inference. The chosen set is typically determined by separately trained alignment model parameters, independent of the source-sentence context at inference time. While vocabulary selection appears competitive with re…
▽ More
Vocabulary selection, or lexical shortlisting, is a well-known technique to improve latency of Neural Machine Translation models by constraining the set of allowed output words during inference. The chosen set is typically determined by separately trained alignment model parameters, independent of the source-sentence context at inference time. While vocabulary selection appears competitive with respect to automatic quality metrics in prior work, we show that it can fail to select the right set of output words, particularly for semantically non-compositional linguistic phenomena such as idiomatic expressions, leading to reduced translation quality as perceived by humans. Trading off latency for quality by increasing the size of the allowed set is often not an option in real-world scenarios. We propose a model of vocabulary selection, integrated into the neural translation model, that predicts the set of allowed output words from contextualized encoder representations. This restores translation quality of an unconstrained system, as measured by human evaluations on WMT newstest2020 and idiomatic expressions, at an inference latency competitive with alignment-based selection using aggressive thresholds, thereby removing the dependency on separately trained alignment models.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
Neural Machine Translation Decoding with Terminology Constraints
Authors:
Eva Hasler,
Adrià De Gispert,
Gonzalo Iglesias,
Bill Byrne
Abstract:
Despite the impressive quality improvements yielded by neural machine translation (NMT) systems, controlling their translation output to adhere to user-provided terminology constraints remains an open problem. We describe our approach to constrained neural decoding based on finite-state machines and multi-stack decoding which supports target-side constraints as well as constraints with correspondi…
▽ More
Despite the impressive quality improvements yielded by neural machine translation (NMT) systems, controlling their translation output to adhere to user-provided terminology constraints remains an open problem. We describe our approach to constrained neural decoding based on finite-state machines and multi-stack decoding which supports target-side constraints as well as constraints with corresponding aligned input text spans. We demonstrate the performance of our framework on multiple translation tasks and motivate the need for constrained decoding with attentions as a means of reducing misplacement and duplication when translating user constraints.
△ Less
Submitted 9 May, 2018;
originally announced May 2018.
-
Accelerating NMT Batched Beam Decoding with LMBR Posteriors for Deployment
Authors:
Gonzalo Iglesias,
William Tambellini,
Adrià De Gispert,
Eva Hasler,
Bill Byrne
Abstract:
We describe a batched beam decoding algorithm for NMT with LMBR n-gram posteriors, showing that LMBR techniques still yield gains on top of the best recently reported results with Transformers. We also discuss acceleration strategies for deployment, and the effect of the beam size and batching on memory and speed.
We describe a batched beam decoding algorithm for NMT with LMBR n-gram posteriors, showing that LMBR techniques still yield gains on top of the best recently reported results with Transformers. We also discuss acceleration strategies for deployment, and the effect of the beam size and batching on memory and speed.
△ Less
Submitted 30 April, 2018;
originally announced April 2018.
-
A Comparison of Neural Models for Word Ordering
Authors:
Eva Hasler,
Felix Stahlberg,
Marcus Tomalin,
Adri`a de Gispert,
Bill Byrne
Abstract:
We compare several language models for the word-ordering task and propose a new bag-to-sequence neural model based on attention-based sequence-to-sequence models. We evaluate the model on a large German WMT data set where it significantly outperforms existing models. We also describe a novel search strategy for LM-based word ordering and report results on the English Penn Treebank. Our best model…
▽ More
We compare several language models for the word-ordering task and propose a new bag-to-sequence neural model based on attention-based sequence-to-sequence models. We evaluate the model on a large German WMT data set where it significantly outperforms existing models. We also describe a novel search strategy for LM-based word ordering and report results on the English Penn Treebank. Our best model setup outperforms prior work both in terms of speed and quality.
△ Less
Submitted 5 August, 2017;
originally announced August 2017.
-
SGNMT -- A Flexible NMT Decoding Platform for Quick Prototy** of New Models and Search Strategies
Authors:
Felix Stahlberg,
Eva Hasler,
Danielle Saunders,
Bill Byrne
Abstract:
This paper introduces SGNMT, our experimental platform for machine translation research. SGNMT provides a generic interface to neural and symbolic scoring modules (predictors) with left-to-right semantic such as translation models like NMT, language models, translation lattices, $n$-best lists or other kinds of scores and constraints. Predictors can be combined with other predictors to form comple…
▽ More
This paper introduces SGNMT, our experimental platform for machine translation research. SGNMT provides a generic interface to neural and symbolic scoring modules (predictors) with left-to-right semantic such as translation models like NMT, language models, translation lattices, $n$-best lists or other kinds of scores and constraints. Predictors can be combined with other predictors to form complex decoding tasks. SGNMT implements a number of search strategies for traversing the space spanned by the predictors which are appropriate for different predictor constellations. Adding new predictors or decoding strategies is particularly easy, making it a very efficient tool for prototy** new research ideas. SGNMT is actively being used by students in the MPhil program in Machine Learning, Speech and Language Technology at the University of Cambridge for course work and theses, as well as for most of the research work in our group.
△ Less
Submitted 21 July, 2017;
originally announced July 2017.
-
Neural Machine Translation by Minimising the Bayes-risk with Respect to Syntactic Translation Lattices
Authors:
Felix Stahlberg,
Adrià de Gispert,
Eva Hasler,
Bill Byrne
Abstract:
We present a novel scheme to combine neural machine translation (NMT) with traditional statistical machine translation (SMT). Our approach borrows ideas from linearised lattice minimum Bayes-risk decoding for SMT. The NMT score is combined with the Bayes-risk of the translation according the SMT lattice. This makes our approach much more flexible than $n$-best list or lattice rescoring as the neur…
▽ More
We present a novel scheme to combine neural machine translation (NMT) with traditional statistical machine translation (SMT). Our approach borrows ideas from linearised lattice minimum Bayes-risk decoding for SMT. The NMT score is combined with the Bayes-risk of the translation according the SMT lattice. This makes our approach much more flexible than $n$-best list or lattice rescoring as the neural decoder is not restricted to the SMT search space. We show an efficient and simple way to integrate risk estimation into the NMT decoder which is suitable for word-level as well as subword-unit-level NMT. We test our method on English-German and Japanese-English and report significant gains over lattice rescoring on several data sets for both single and ensembled NMT. The MBR decoder produces entirely new hypotheses far beyond simply rescoring the SMT search space or fixing UNKs in the NMT output.
△ Less
Submitted 13 February, 2017; v1 submitted 12 December, 2016;
originally announced December 2016.
-
The Edit Distance Transducer in Action: The University of Cambridge English-German System at WMT16
Authors:
Felix Stahlberg,
Eva Hasler,
Bill Byrne
Abstract:
This paper presents the University of Cambridge submission to WMT16. Motivated by the complementary nature of syntactical machine translation and neural machine translation (NMT), we exploit the synergies of Hiero and NMT in different combination schemes. Starting out with a simple neural lattice rescoring approach, we show that the Hiero lattices are often too narrow for NMT ensembles. Therefore,…
▽ More
This paper presents the University of Cambridge submission to WMT16. Motivated by the complementary nature of syntactical machine translation and neural machine translation (NMT), we exploit the synergies of Hiero and NMT in different combination schemes. Starting out with a simple neural lattice rescoring approach, we show that the Hiero lattices are often too narrow for NMT ensembles. Therefore, instead of a hard restriction of the NMT search space to the lattice, we propose to loosely couple NMT and Hiero by composition with a modified version of the edit distance transducer. The loose combination outperforms lattice rescoring, especially when using multiple NMT systems in an ensemble.
△ Less
Submitted 15 June, 2016;
originally announced June 2016.
-
Syntactically Guided Neural Machine Translation
Authors:
Felix Stahlberg,
Eva Hasler,
Aurelien Waite,
Bill Byrne
Abstract:
We investigate the use of hierarchical phrase-based SMT lattices in end-to-end neural machine translation (NMT). Weight pushing transforms the Hiero scores for complete translation hypotheses, with the full translation grammar score and full n-gram language model score, into posteriors compatible with NMT predictive probabilities. With a slightly modified NMT beam-search decoder we find gains over…
▽ More
We investigate the use of hierarchical phrase-based SMT lattices in end-to-end neural machine translation (NMT). Weight pushing transforms the Hiero scores for complete translation hypotheses, with the full translation grammar score and full n-gram language model score, into posteriors compatible with NMT predictive probabilities. With a slightly modified NMT beam-search decoder we find gains over both Hiero and NMT decoding alone, with practical advantages in extending NMT to very large input and output vocabularies.
△ Less
Submitted 19 May, 2016; v1 submitted 15 May, 2016;
originally announced May 2016.
-
Multilingual Image Description with Neural Sequence Models
Authors:
Desmond Elliott,
Stella Frank,
Eva Hasler
Abstract:
In this paper we present an approach to multi-language image description bringing together insights from neural machine translation and neural image description. To create a description of an image for a given target language, our sequence generation models condition on feature vectors from the image, the description from the source language, and/or a multimodal vector computed over the image and…
▽ More
In this paper we present an approach to multi-language image description bringing together insights from neural machine translation and neural image description. To create a description of an image for a given target language, our sequence generation models condition on feature vectors from the image, the description from the source language, and/or a multimodal vector computed over the image and a description in the source language. In image description experiments on the IAPR-TC12 dataset of images aligned with English and German sentences, we find significant and substantial improvements in BLEU4 and Meteor scores for models trained over multiple languages, compared to a monolingual baseline.
△ Less
Submitted 18 November, 2015; v1 submitted 15 October, 2015;
originally announced October 2015.