Skip to main content

Showing 1–17 of 17 results for author: Helcl, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13560  [pdf, other

    cs.CL

    Lexically Grounded Subword Segmentation

    Authors: **dřich Libovický, **dřich Helcl

    Abstract: We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an algebraic method for obtaining subword embeddings grounded in a word embedding space. Based on that, we design a novel subword segmentation algorithm that uses the embeddings, ensuring that the procedure consid… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: 8 pages (+ 8 pages appendix), 2 figures

  2. arXiv:2404.06964  [pdf, other

    cs.CL

    Charles Translator: A Machine Translation System between Ukrainian and Czech

    Authors: Martin Popel, Lucie Poláková, Michal Novák, **dřich Helcl, **dřich Libovický, Pavel Straňák, Tomáš Krabač, Jaroslava Hlaváčová, Mariia Anisimova, Tereza Chlaňová

    Abstract: We present Charles Translator, a machine translation system between Ukrainian and Czech, developed as part of a society-wide effort to mitigate the impact of the Russian-Ukrainian war on individuals and society. The system was developed in the spring of 2022 with the help of many language data providers in order to quickly meet the demand for such a service, which was not available at the time in… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  3. arXiv:2311.14838  [pdf, other

    cs.CL

    OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models

    Authors: Nikolay Bogoychev, Jelmer van der Linde, Graeme Nail, Barry Haddow, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Lukas Weymann, Tudor Nicolae Mateiu, **dřich Helcl, Mikko Aulamo

    Abstract: Develo** high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, reduce the amount of work and lower the entry barrier for newcomers. OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researc… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

    Comments: Code on Github: https://github.com/hplt-project/OpusCleaner and https://github.com/hplt-project/OpusTrainer

  4. arXiv:2310.16528  [pdf, other

    cs.CL

    CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task Information Retrieval

    Authors: **dřich Helcl, **dřich Libovický

    Abstract: We present the Charles University system for the MRL~2023 Shared Task on Multi-lingual Multi-task Information Retrieval. The goal of the shared task was to develop systems for named entity recognition and question answering in several under-represented languages. Our solutions to both subtasks rely on the translate-test approach. We first translate the unlabeled examples into English using a multi… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: 8 pages, 2 figures; System description paper at the MRL 2023 workshop at EMNLP 2023

  5. arXiv:2212.00486  [pdf, other

    cs.CL

    CUNI Systems for the WMT22 Czech-Ukrainian Translation Task

    Authors: Martin Popel, **dřich Libovický, **dřich Helcl

    Abstract: We present Charles University submissions to the WMT22 General Translation Shared Task on Czech-Ukrainian and Ukrainian-Czech machine translation. We present two constrained submissions based on block back-translation and tagged back-translation and experiment with rule-based romanization of Ukrainian. Our results show that the romanization only has a minor effect on the translation quality. Furth… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

    Comments: 6 pages; System description paper at WMT22

  6. arXiv:2212.00477  [pdf, other

    cs.CL

    CUNI Non-Autoregressive System for the WMT 22 Efficient Translation Shared Task

    Authors: **dřich Helcl

    Abstract: We present a non-autoregressive system submission to the WMT 22 Efficient Translation Shared Task. Our system was used by Helcl et al. (2022) in an attempt to provide fair comparison between non-autoregressive and autoregressive models. This submission is an effort to establish solid baselines along with sound evaluation methodology, particularly in terms of measuring the decoding speed. The model… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

  7. arXiv:2205.01966  [pdf, other

    cs.CL

    Non-Autoregressive Machine Translation: It's Not as Fast as it Seems

    Authors: **dřich Helcl, Barry Haddow, Alexandra Birch

    Abstract: Efficient machine translation models are commercially important as they can increase inference speeds, and reduce costs and carbon emissions. Recently, there has been much interest in non-autoregressive (NAR) models, which promise faster translation. In parallel to the research on NAR models, there have been successful attempts to create optimized autoregressive models as part of the WMT shared ta… ▽ More

    Submitted 4 May, 2022; originally announced May 2022.

    Comments: NAACL 2022, Camera-ready

  8. arXiv:2109.00486  [pdf, other

    cs.CL

    Survey of Low-Resource Machine Translation

    Authors: Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, **dřich Helcl, Alexandra Birch

    Abstract: We present a survey covering the state of the art in low-resource machine translation research. There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated train… ▽ More

    Submitted 7 February, 2022; v1 submitted 1 September, 2021; originally announced September 2021.

  9. arXiv:2004.03227  [pdf, ps, other

    cs.CL

    Improving Fluency of Non-Autoregressive Machine Translation

    Authors: Zdeněk Kasner, **dřich Libovický, **dřich Helcl

    Abstract: Non-autoregressive (nAR) models for machine translation (MT) manifest superior decoding speed when compared to autoregressive (AR) models, at the expense of impaired fluency of their outputs. We improve the fluency of a nAR model with connectionist temporal classification (CTC) by employing additional features in the scoring model used during beam search decoding. Since the beam search decoding in… ▽ More

    Submitted 7 April, 2020; originally announced April 2020.

  10. arXiv:1906.09246  [pdf, ps, other

    cs.CL

    CUNI System for the WMT19 Robustness Task

    Authors: **dřich Helcl, **dřich Libovický, Martin Popel

    Abstract: We present our submission to the WMT19 Robustness Task. Our baseline system is the Charles University (CUNI) Transformer system trained for the WMT18 shared task on News Translation. Quantitative results show that the CUNI Transformer system is already far more robust to noisy input than the LSTM-based baseline provided by the task organizers. We further improved the performance of our model by fi… ▽ More

    Submitted 21 June, 2019; originally announced June 2019.

    Comments: WMT19

  11. arXiv:1811.04719  [pdf, ps, other

    cs.CL

    End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification

    Authors: **dřich Libovický, **dřich Helcl

    Abstract: Autoregressive decoding is the only part of sequence-to-sequence models that prevents them from massive parallelization at inference time. Non-autoregressive models enable the decoder to generate all output symbols independently in parallel. We present a novel non-autoregressive architecture based on connectionist temporal classification and evaluate it on the task of neural machine translation. U… ▽ More

    Submitted 12 November, 2018; originally announced November 2018.

    Comments: EMNLP 2018

  12. arXiv:1811.04716  [pdf, other

    cs.CL

    Input Combination Strategies for Multi-Source Transformer Decoder

    Authors: **dřich Libovický, **dřich Helcl, David Mareček

    Abstract: In multi-source sequence-to-sequence tasks, the attention mechanism can be modeled in several ways. This topic has been thoroughly studied on recurrent architectures. In this paper, we extend the previous work to the encoder-decoder attention in the Transformer architecture. We propose four different input combination strategies for the encoder-decoder attention: serial, parallel, flat, and hierar… ▽ More

    Submitted 12 November, 2018; originally announced November 2018.

    Comments: Published at WMT18

  13. arXiv:1811.04697  [pdf, ps, other

    cs.CL

    CUNI System for the WMT18 Multimodal Translation Task

    Authors: **dřich Helcl, **dřich Libovický, Dušan Variš

    Abstract: We present our submission to the WMT18 Multimodal Translation Task. The main feature of our submission is applying a self-attentive network instead of a recurrent neural network. We evaluate two methods of incorporating the visual features in the model: first, we include the image representation as another input to the network; second, we train the model to predict the visual features and use it a… ▽ More

    Submitted 12 November, 2018; originally announced November 2018.

    Comments: Published at WMT18

  14. arXiv:1707.07631  [pdf, other

    cs.CL

    Deep Architectures for Neural Machine Translation

    Authors: Antonio Valerio Miceli Barone, **dřich Helcl, Rico Sennrich, Barry Haddow, Alexandra Birch

    Abstract: It has been shown that increasing model depth improves the quality of neural machine translation. However, different architectural variants to increase model depth have been proposed, and so far, there has been no thorough comparative study. In this work, we describe and evaluate several existing approaches to introduce depth in neural machine translation. Additionally, we explore novel architec… ▽ More

    Submitted 24 July, 2017; originally announced July 2017.

    Comments: WMT 2017 research track

  15. arXiv:1707.04550  [pdf, other

    cs.CL cs.NE

    CUNI System for the WMT17 Multimodal Translation Task

    Authors: **dřich Helcl, **dřich Libovický

    Abstract: In this paper, we describe our submissions to the WMT17 Multimodal Translation Task. For Task 1 (multimodal translation), our best scoring system is a purely textual neural translation of the source image caption to the target language. The main feature of the system is the use of additional data that was acquired by selecting similar sentences from parallel corpora and by data synthesis with back… ▽ More

    Submitted 14 July, 2017; originally announced July 2017.

    Comments: 8 pages; Camera-ready submission to WMT17

    ACM Class: I.2.7

  16. arXiv:1704.06567  [pdf, other

    cs.CL cs.NE

    Attention Strategies for Multi-Source Sequence-to-Sequence Learning

    Authors: **dřich Libovický, **dřich Helcl

    Abstract: Modeling attention in neural multi-source sequence-to-sequence learning remains a relatively unexplored area, despite its usefulness in tasks that incorporate multiple source languages or modalities. We propose two novel approaches to combine the outputs of attention mechanisms over each source sequence, flat and hierarchical. We compare the proposed methods with existing techniques and present re… ▽ More

    Submitted 21 April, 2017; originally announced April 2017.

    Comments: 7 pages; Accepted to ACL 2017

    MSC Class: 68T50 ACM Class: I.2.7

  17. arXiv:1606.07481  [pdf, other

    cs.CL

    CUNI System for WMT16 Automatic Post-Editing and Multimodal Translation Tasks

    Authors: **dřich Libovický, **dřich Helcl, Marek Tlustý, Pavel Pecina, Ondřej Bojar

    Abstract: Neural sequence to sequence learning recently became a very promising paradigm in machine translation, achieving competitive results with statistical phrase-based systems. In this system description paper, we attempt to utilize several recently published methods used for neural sequential learning in order to build systems for WMT 2016 shared tasks of Automatic Post-Editing and Multimodal Machine… ▽ More

    Submitted 23 June, 2016; originally announced June 2016.

    Comments: Accepted to the First Conference of Machine Translation (WMT16)