Skip to main content

Showing 1–15 of 15 results for author: Bogoychev, N

.
  1. arXiv:2311.14838  [pdf, other

    cs.CL

    OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models

    Authors: Nikolay Bogoychev, Jelmer van der Linde, Graeme Nail, Barry Haddow, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Lukas Weymann, Tudor Nicolae Mateiu, **dřich Helcl, Mikko Aulamo

    Abstract: Develo** high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, reduce the amount of work and lower the entry barrier for newcomers. OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researc… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

    Comments: Code on Github: https://github.com/hplt-project/OpusCleaner and https://github.com/hplt-project/OpusTrainer

  2. arXiv:2311.09709  [pdf, other

    cs.CL

    The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics

    Authors: Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch

    Abstract: Deploying large language models (LLMs) encounters challenges due to intensive computational and memory requirements. Our research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency. While such modifications have been proven effective in tasks like machine translation, tailoring them to LLMs demands specific… ▽ More

    Submitted 28 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: Versions 2, accepted at Insights from negative results 2024

  3. arXiv:2310.05824  [pdf, other

    cs.CL

    Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting

    Authors: Nikolay Bogoychev, Pinzhen Chen

    Abstract: Terminology correctness is important in the downstream application of machine translation, and a prevalent way to ensure this is to inject terminology constraints into a translation system. In our submission to the WMT 2023 terminology translation task, we adopt a translate-then-refine approach which can be domain-independent and requires minimal manual efforts. We annotate random source words wit… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: WMT 2023 Terminology Translation Task

  4. arXiv:2309.08958  [pdf, other

    cs.CL cs.AI

    Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

    Authors: Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, Kenneth Heafield

    Abstract: Foundational large language models (LLMs) can be instruction-tuned to perform open-domain question answering, facilitating applications like chat assistants. While such efforts are often carried out in a single language, we empirically analyze cost-efficient strategies for multilingual scenarios. Our study employs the Alpaca dataset and machine translations of it to form multilingual data, which i… ▽ More

    Submitted 30 January, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

    Comments: Accepted to Findings of ACL: EACL 2024. Added human evaluation and shortened writing

  5. An Open Dataset and Model for Language Identification

    Authors: Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, Kenneth Heafield

    Abstract: Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: To be published in ACL 2023

  6. arXiv:2303.18110  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR

    Authors: Ramon Sanabria, Nikolay Bogoychev, Nina Markl, Andrea Carmantini, Ondrej Klejch, Peter Bell

    Abstract: English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a result, there are many varieties of English. Although the great many advances in English automatic speech recognition (ASR) over the past decades, results are usually reported based on test datasets which fail to represent the diversity of English… ▽ More

    Submitted 31 March, 2023; originally announced March 2023.

    Comments: Accepted to IEEE ICASSP 2023

  7. arXiv:2203.06462  [pdf, other

    cs.LG cs.CL

    Low-Rank Softmax Can Have Unargmaxable Classes in Theory but Rarely in Practice

    Authors: Andreas Grivas, Nikolay Bogoychev, Adam Lopez

    Abstract: Classifiers in natural language processing (NLP) often have a large number of output classes. For example, neural language models (LMs) and machine translation (MT) models both predict tokens from a vocabulary of thousands. The Softmax output layer of these models typically receives as input a dense feature representation, which has much lower dimensionality than the output. In theory, the result… ▽ More

    Submitted 21 March, 2022; v1 submitted 12 March, 2022; originally announced March 2022.

    Comments: Preprint of conference paper accepted at ACL 2022

  8. arXiv:2109.10194  [pdf, other

    cs.CL

    TranslateLocally: Blazing-fast translation running on the local CPU

    Authors: Nikolay Bogoychev, Jelmer Van der Linde, Kenneth Heafield

    Abstract: Every day, millions of people sacrifice their privacy and browsing habits in exchange for online machine translation. Companies and governments with confidentiality requirements often ban online translation or pay a premium to disable logging. To bring control back to the end user and demonstrate speed, we developed translateLocally. Running locally on a desktop or laptop CPU, translateLocally del… ▽ More

    Submitted 21 September, 2021; originally announced September 2021.

    Comments: Accepted at EMNLP 2021 demo track; https://translatelocally.com

  9. arXiv:2101.00421  [pdf, other

    cs.CL

    The Highs and Lows of Simple Lexical Domain Adaptation Approaches for Neural Machine Translation

    Authors: Nikolay Bogoychev, Pinzhen Chen

    Abstract: Machine translation systems are vulnerable to domain mismatch, especially in a low-resource scenario. Out-of-domain translations are often of poor quality and prone to hallucinations, due to exposure bias and the decoder acting as a language model. We adopt two approaches to alleviate this problem: lexical shortlisting restricted by IBM statistical alignments, and hypothesis re-ranking based on si… ▽ More

    Submitted 21 September, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

    Comments: Accepted at Workshop on Insights from Negative Results in NLP 2021

  10. arXiv:2010.11859  [pdf, other

    cs.CL

    Not all parameters are born equal: Attention is mostly what you need

    Authors: Nikolay Bogoychev

    Abstract: Transformers are widely used in state-of-the-art machine translation, but the key to their success is still unknown. To gain insight into this, we consider three groups of parameters: embeddings, attention, and feed forward neural network (FFN) layers. We examine the relative importance of each by performing an ablation study where we initialise them at random and freeze them, so that their weight… ▽ More

    Submitted 21 September, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted at BlackboxNLP 2021

  11. arXiv:1911.03362  [pdf, other

    cs.CL cs.LG stat.ML

    Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation

    Authors: Nikolay Bogoychev, Rico Sennrich

    Abstract: The quality of neural machine translation can be improved by leveraging additional monolingual resources to create synthetic training data. Source-side monolingual data can be (forward-)translated into the target language for self-training; target-side monolingual data can be back-translated. It has been widely reported that back-translation delivers superior results, but could this be due to arte… ▽ More

    Submitted 3 October, 2020; v1 submitted 6 November, 2019; originally announced November 2019.

  12. arXiv:1907.05854  [pdf, other

    cs.CL

    The University of Edinburgh's Submissions to the WMT19 News Translation Task

    Authors: Rachel Bawden, Nikolay Bogoychev, Ulrich Germann, Roman Grundkiewicz, Faheem Kirefu, Antonio Valerio Miceli Barone, Alexandra Birch

    Abstract: The University of Edinburgh participated in the WMT19 Shared Task on News Translation in six language directions: English-to-Gujarati, Gujarati-to-English, English-to-Chinese, Chinese-to-English, German-to-English, and English-to-Czech. For all translation directions, we created or used back-translations of monolingual data in the target language as additional synthetic training data. For English-… ▽ More

    Submitted 12 July, 2019; originally announced July 2019.

    Comments: To appear in the Proceedings of WMT19: Shared Task Papers

  13. arXiv:1808.08859  [pdf, ps, other

    cs.CL

    Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation

    Authors: Nikolay Bogoychev, Marcin Junczys-Dowmunt, Kenneth Heafield, Alham Fikri Aji

    Abstract: In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that delays gradient updates effectively increasing the mini-batch size. Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in… ▽ More

    Submitted 14 September, 2018; v1 submitted 27 August, 2018; originally announced August 2018.

    Comments: To appear in EMNLP 2018 as a short paper

  14. arXiv:1804.00344  [pdf, other

    cs.CL

    Marian: Fast Neural Machine Translation in C++

    Authors: Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, Alexandra Birch

    Abstract: We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

    Submitted 4 April, 2018; v1 submitted 1 April, 2018; originally announced April 2018.

    Comments: Demonstration paper

  15. arXiv:1610.04265  [pdf, other

    cs.CL

    Fast, Scalable Phrase-Based SMT Decoding

    Authors: Hieu Hoang, Nikolay Bogoychev, Lane Schwartz, Marcin Junczys-Dowmunt

    Abstract: The utilization of statistical machine translation (SMT) has grown enormously over the last decade, many using open-source software developed by the NLP community. As commercial use has increased, there is need for software that is optimized for commercial requirements, in particular, fast phrase-based decoding and more efficient utilization of modern multicore servers. In this paper we re-exami… ▽ More

    Submitted 18 October, 2016; v1 submitted 13 October, 2016; originally announced October 2016.