Skip to main content

Showing 1–32 of 32 results for author: Libovický, J

.
  1. arXiv:2406.13560  [pdf, other

    cs.CL

    Lexically Grounded Subword Segmentation

    Authors: **dřich Libovický, **dřich Helcl

    Abstract: We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an algebraic method for obtaining subword embeddings grounded in a word embedding space. Based on that, we design a novel subword segmentation algorithm that uses the embeddings, ensuring that the procedure consid… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: 8 pages (+ 8 pages appendix), 2 figures

  2. arXiv:2404.06964  [pdf, other

    cs.CL

    Charles Translator: A Machine Translation System between Ukrainian and Czech

    Authors: Martin Popel, Lucie Poláková, Michal Novák, **dřich Helcl, **dřich Libovický, Pavel Straňák, Tomáš Krabač, Jaroslava Hlaváčová, Mariia Anisimova, Tereza Chlaňová

    Abstract: We present Charles Translator, a machine translation system between Ukrainian and Czech, developed as part of a society-wide effort to mitigate the impact of the Russian-Ukrainian war on individuals and society. The system was developed in the spring of 2022 with the help of many language data providers in order to quickly meet the demand for such a service, which was not available at the time in… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  3. arXiv:2404.06228  [pdf, other

    cs.CL

    Understanding Cross-Lingual Alignment -- A Survey

    Authors: Katharina Hämmerl, **dřich Libovický, Alexander Fraser

    Abstract: Cross-lingual alignment, the meaningful similarity of representations across languages in multilingual language models, has been an active field of research in recent years. We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field. We present different understandings of cross-lingual alignment and… ▽ More

    Submitted 11 June, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: Camera-ready version, ACL Findings 2024

  4. arXiv:2403.13514  [pdf, other

    cs.CL cs.CY

    How Gender Interacts with Political Values: A Case Study on Czech BERT Models

    Authors: Adnan Al Ali, **dřich Libovický

    Abstract: Neural language models, which reach state-of-the-art results on most natural language processing tasks, are trained on large text corpora that inevitably contain value-burdened content and often capture undesirable biases, which the models reflect. This case study focuses on the political biases of pre-trained encoders in Czech and compares them with a representative value survey. Because Czech is… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: 11 pages, 2 figures; LREC-COLING 2024

  5. arXiv:2403.02875  [pdf, other

    cs.CV cs.CL cs.IR

    Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples

    Authors: Philipp J. Rösch, Norbert Oswald, Michaela Geierhos, **dřich Libovický

    Abstract: Current multimodal models leveraging contrastive learning often face limitations in develo** fine-grained conceptual understanding. This is due to random negative samples during pretraining, causing almost exclusively very dissimilar concepts to be compared in the loss function. Consequently, the models struggle with fine-grained semantic differences. To address this problem, we introduce a nove… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

    Comments: 22 pages

  6. arXiv:2401.16092  [pdf, other

    cs.CL cs.CY cs.LG

    Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You

    Authors: Felix Friedrich, Katharina Hämmerl, Patrick Schramowski, Manuel Brack, **drich Libovicky, Kristian Kersting, Alexander Fraser

    Abstract: Text-to-image generation models have recently achieved astonishing results in image quality, flexibility, and text alignment, and are consequently employed in a fast-growing number of applications. Through improvements in multilingual abilities, a larger community now has access to this technology. However, our results show that multilingual models suffer from significant gender biases just as mon… ▽ More

    Submitted 15 May, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

  7. arXiv:2310.16528  [pdf, other

    cs.CL

    CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task Information Retrieval

    Authors: **dřich Helcl, **dřich Libovický

    Abstract: We present the Charles University system for the MRL~2023 Shared Task on Multi-lingual Multi-task Information Retrieval. The goal of the shared task was to develop systems for named entity recognition and question answering in several under-represented languages. Our solutions to both subtasks rely on the translate-test approach. We first translate the unlabeled examples into English using a multi… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: 8 pages, 2 figures; System description paper at the MRL 2023 workshop at EMNLP 2023

  8. arXiv:2307.10666  [pdf, other

    cs.CL

    A Dataset and Strong Baselines for Classification of Czech News Texts

    Authors: Hynek Kydlíček, **dřich Libovický

    Abstract: Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch~NEws~Classification~dataset (CZE-NEC), one of the largest Czech classification datasets, composed of ne… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

    Comments: 12 pages, Accepted to Text, Speech and Dialogue (TSD) 2023

  9. arXiv:2306.00458  [pdf, other

    cs.CL

    Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity

    Authors: Katharina Hämmerl, Alina Fastowski, **dřich Libovický, Alexander Fraser

    Abstract: Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings, and typically display outlier dimensions. This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context. Why these outliers occur and how they affect the representations is still an active are… ▽ More

    Submitted 7 June, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: To appear in ACL Findings 2023. Fixed a citation in this version

  10. arXiv:2305.14482  [pdf, other

    cs.CL

    Is a Prestigious Job the same as a Prestigious Country? A Case Study on Multilingual Sentence Embeddings and European Countries

    Authors: **dřich Libovický

    Abstract: We study how multilingual sentence representations capture European countries and occupations and how this differs across European languages. We prompt the models with templated sentences that we machine-translate into 12 European languages and analyze the most prominent dimensions in the embeddings.Our analysis reveals that the most prominent feature in the embedding is the geopolitical distincti… ▽ More

    Submitted 25 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 10 pages, 1 figure; Findings of EMNLP 2023, camera-ready

  11. Probing the Role of Positional Information in Vision-Language Models

    Authors: Philipp J. Rösch, **dřich Libovický

    Abstract: In most Vision-Language models (VL), the understanding of the image structure is enabled by injecting the position information (PI) about objects in the image. In our case study of LXMERT, a state-of-the-art VL model, we probe the use of the PI in the representation and study its effect on Visual Question Answering. We show that the model is not capable of leveraging the PI for the image-text matc… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

    Comments: Findings of the Association for Computational Linguistics: NAACL 2022, pages 1031-1041, Seattle, United States. Association for Computational Linguistics

    ACM Class: I.4; I.7

  12. arXiv:2212.00486  [pdf, other

    cs.CL

    CUNI Systems for the WMT22 Czech-Ukrainian Translation Task

    Authors: Martin Popel, **dřich Libovický, **dřich Helcl

    Abstract: We present Charles University submissions to the WMT22 General Translation Shared Task on Czech-Ukrainian and Ukrainian-Czech machine translation. We present two constrained submissions based on block back-translation and tagged back-translation and experiment with rule-based romanization of Ukrainian. Our results show that the romanization only has a minor effect on the translation quality. Furth… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

    Comments: 6 pages; System description paper at WMT22

  13. arXiv:2211.07733  [pdf, other

    cs.CL

    Speaking Multiple Languages Affects the Moral Bias of Language Models

    Authors: Katharina Hämmerl, Björn Deiseroth, Patrick Schramowski, **dřich Libovický, Constantin A. Rothkopf, Alexander Fraser, Kristian Kersting

    Abstract: Pre-trained multilingual language models (PMLMs) are commonly used when dealing with data from multiple languages and cross-lingual transfer. However, PMLMs are trained on varying amounts of data for each language. In practice this means their performance is often much better on English than many other languages. We explore to what extent this also applies to moral norms. Do the models capture mor… ▽ More

    Submitted 1 June, 2023; v1 submitted 14 November, 2022; originally announced November 2022.

    Comments: To appear in ACL Findings 2023

  14. arXiv:2203.09904  [pdf, ps, other

    cs.CL

    Do Multilingual Language Models Capture Differing Moral Norms?

    Authors: Katharina Hämmerl, Björn Deiseroth, Patrick Schramowski, **dřich Libovický, Alexander Fraser, Kristian Kersting

    Abstract: Massively multilingual sentence representations are trained on large corpora of uncurated data, with a very imbalanced proportion of languages included in the training. This may cause the models to grasp cultural values including moral judgments from the high-resource languages and impose them on the low-resource languages. The lack of data in certain languages can also lead to develo** random a… ▽ More

    Submitted 18 March, 2022; originally announced March 2022.

  15. arXiv:2203.09326  [pdf, other

    cs.CL

    Combining Static and Contextualised Multilingual Embeddings

    Authors: Katharina Hämmerl, **dřich Libovický, Alexander Fraser

    Abstract: Static and contextual multilingual embeddings have complementary strengths. Static embeddings, while less expressive than contextual language models, can be more straightforwardly aligned across multiple languages. We combine the strengths of static and contextual models to improve multilingual representations. We extract static embeddings for 40 languages from XLM-R, validate those embeddings wit… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

    Comments: Accepted to Findings of ACL 2022

  16. arXiv:2112.08288  [pdf, other

    cs.CL

    Improving Both Domain Robustness and Domain Adaptability in Machine Translation

    Authors: Wen Lai, **dřich Libovický, Alexander Fraser

    Abstract: We consider two problems of NMT domain adaptation using meta-learning. First, we want to reach domain robustness, i.e., we want to reach high quality on both domains seen in the training data and unseen domains. Second, we want our systems to be adaptive, i.e., making it possible to finetune systems with just hundreds of in-domain parallel sentences. We study the domain adaptability of meta-learni… ▽ More

    Submitted 4 October, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

    Comments: Accepted to COLING 2022

  17. arXiv:2110.08191  [pdf, other

    cs.CL

    Why don't people use character-level machine translation?

    Authors: **dřich Libovický, Helmut Schmid, Alexander Fraser

    Abstract: We present a literature and empirical survey that critically assesses the state of the art in character-level modeling for machine translation (MT). Despite evidence in the literature that character-level systems are comparable with subword systems, they are virtually never used in competitive setups in WMT competitions. We empirically show that even with recent modeling innovations in character-l… ▽ More

    Submitted 27 April, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: 16 pages, 4 figures; Findings of ACL 2022, camera-ready

  18. arXiv:2104.08388  [pdf, other

    cs.CL cs.LG

    Neural String Edit Distance

    Authors: **dřich Libovický, Alexander Fraser

    Abstract: We propose the neural string edit distance model for string-pair matching and string transduction based on learnable string edit distance. We modify the original expectation-maximization learned edit distance algorithm into a differentiable loss function, allowing us to integrate it into a neural network providing a contextual representation of the input. We evaluate on cognate detection, translit… ▽ More

    Submitted 27 April, 2022; v1 submitted 16 April, 2021; originally announced April 2021.

    Comments: 14 pages, 5 figures; Workshop on Structured Prediction for NLP @ACL 2022, camera-ready

  19. arXiv:2004.14280  [pdf, other

    cs.CL

    Towards Reasonably-Sized Character-Level Transformer NMT by Finetuning Subword Systems

    Authors: **dřich Libovický, Alexander Fraser

    Abstract: Applying the Transformer architecture on the character level usually requires very deep architectures that are difficult and slow to train. These problems can be partially overcome by incorporating a segmentation into tokens in the model. We show that by initially training a subword model and then finetuning it on characters, we can obtain a neural machine translation model that works at the chara… ▽ More

    Submitted 29 September, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: 8 pages, 1 figure; Accepted to EMNLP 2020

  20. arXiv:2004.05160  [pdf, other

    cs.CL

    On the Language Neutrality of Pre-trained Multilingual Representations

    Authors: **dřich Libovický, Rudolf Rosa, Alexander Fraser

    Abstract: Multilingual contextual embeddings, such as multilingual BERT and XLM-RoBERTa, have proved useful for many multi-lingual tasks. Previous work probed the cross-linguality of the representations indirectly using zero-shot transfer learning on morphological and syntactic tasks. We instead investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical se… ▽ More

    Submitted 29 September, 2020; v1 submitted 9 April, 2020; originally announced April 2020.

    Comments: 12 pages, 3 figures. arXiv admin note: text overlap with arXiv:1911.03310. Accepted to Findings of EMNLP 2020

  21. arXiv:2004.03227  [pdf, ps, other

    cs.CL

    Improving Fluency of Non-Autoregressive Machine Translation

    Authors: Zdeněk Kasner, **dřich Libovický, **dřich Helcl

    Abstract: Non-autoregressive (nAR) models for machine translation (MT) manifest superior decoding speed when compared to autoregressive (AR) models, at the expense of impaired fluency of their outputs. We improve the fluency of a nAR model with connectionist temporal classification (CTC) by employing additional features in the scoring model used during beam search decoding. Since the beam search decoding in… ▽ More

    Submitted 7 April, 2020; originally announced April 2020.

  22. arXiv:1911.03310  [pdf, other

    cs.CL

    How Language-Neutral is Multilingual BERT?

    Authors: **dřich Libovický, Rudolf Rosa, Alexander Fraser

    Abstract: Multilingual BERT (mBERT) provides sentence representations for 104 languages, which are useful for many multi-lingual tasks. Previous work probed the cross-linguality of mBERT using zero-shot transfer learning on morphological and syntactic tasks. We instead focus on the semantic properties of mBERT. We show that mBERT representations can be split into a language-specific component and a language… ▽ More

    Submitted 8 November, 2019; originally announced November 2019.

    Comments: 6 pages, 3 figures

  23. arXiv:1908.11125  [pdf, other

    cs.CL

    Probing Representations Learned by Multimodal Recurrent and Transformer Models

    Authors: **dřich Libovický, Pranava Madhyastha

    Abstract: Recent literature shows that large-scale language modeling provides excellent reusable sentence representations with both recurrent and self-attentive architectures. However, there has been less clarity on the commonalities and differences in the representational properties induced by the two architectures. It also has been shown that visual information serves as one of the means for grounding sen… ▽ More

    Submitted 29 August, 2019; originally announced August 2019.

    Comments: 8 pages, 2 figures

    MSC Class: 68T50 ACM Class: I.2.7

  24. arXiv:1907.04613  [pdf, other

    cs.CL cs.LG

    Neural Networks as Explicit Word-Based Rules

    Authors: **dřich Libovický

    Abstract: Filters of convolutional networks used in computer vision are often visualized as image patches that maximize the response of the filter. We use the same approach to interpret weight matrices in simple architectures for natural language processing tasks. We interpret a convolutional network for sentiment classification as word-based rules. Using the rule, we recover the performance of the original… ▽ More

    Submitted 10 July, 2019; originally announced July 2019.

    Comments: 3 pages; extended abstract at BlackboxNLP 2019

    MSC Class: 68T50 ACM Class: I.2.7

  25. arXiv:1906.09246  [pdf, ps, other

    cs.CL

    CUNI System for the WMT19 Robustness Task

    Authors: **dřich Helcl, **dřich Libovický, Martin Popel

    Abstract: We present our submission to the WMT19 Robustness Task. Our baseline system is the Charles University (CUNI) Transformer system trained for the WMT18 shared task on News Translation. Quantitative results show that the CUNI Transformer system is already far more robust to noisy input than the LSTM-based baseline provided by the task organizers. We further improved the performance of our model by fi… ▽ More

    Submitted 21 June, 2019; originally announced June 2019.

    Comments: WMT19

  26. arXiv:1906.07901  [pdf, other

    cs.CL cs.CV cs.LG cs.MM

    Multimodal Abstractive Summarization for How2 Videos

    Authors: Shruti Palaskar, **drich Libovický, Spandana Gella, Florian Metze

    Abstract: In this paper, we study abstractive summarization for open-domain videos. Unlike the traditional text news summarization, the goal is less to "compress" text information but rather to provide a fluent textual summary of information that has been collected and fused from different source modalities, in our case video and audio transcripts (or text). We show how a multi-source sequence-to-sequence m… ▽ More

    Submitted 18 June, 2019; originally announced June 2019.

    Comments: To appear in ACL 2019

  27. arXiv:1811.04719  [pdf, ps, other

    cs.CL

    End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification

    Authors: **dřich Libovický, **dřich Helcl

    Abstract: Autoregressive decoding is the only part of sequence-to-sequence models that prevents them from massive parallelization at inference time. Non-autoregressive models enable the decoder to generate all output symbols independently in parallel. We present a novel non-autoregressive architecture based on connectionist temporal classification and evaluate it on the task of neural machine translation. U… ▽ More

    Submitted 12 November, 2018; originally announced November 2018.

    Comments: EMNLP 2018

  28. arXiv:1811.04716  [pdf, other

    cs.CL

    Input Combination Strategies for Multi-Source Transformer Decoder

    Authors: **dřich Libovický, **dřich Helcl, David Mareček

    Abstract: In multi-source sequence-to-sequence tasks, the attention mechanism can be modeled in several ways. This topic has been thoroughly studied on recurrent architectures. In this paper, we extend the previous work to the encoder-decoder attention in the Transformer architecture. We propose four different input combination strategies for the encoder-decoder attention: serial, parallel, flat, and hierar… ▽ More

    Submitted 12 November, 2018; originally announced November 2018.

    Comments: Published at WMT18

  29. arXiv:1811.04697  [pdf, ps, other

    cs.CL

    CUNI System for the WMT18 Multimodal Translation Task

    Authors: **dřich Helcl, **dřich Libovický, Dušan Variš

    Abstract: We present our submission to the WMT18 Multimodal Translation Task. The main feature of our submission is applying a self-attentive network instead of a recurrent neural network. We evaluate two methods of incorporating the visual features in the model: first, we include the image representation as another input to the network; second, we train the model to predict the visual features and use it a… ▽ More

    Submitted 12 November, 2018; originally announced November 2018.

    Comments: Published at WMT18

  30. arXiv:1707.04550  [pdf, other

    cs.CL cs.NE

    CUNI System for the WMT17 Multimodal Translation Task

    Authors: **dřich Helcl, **dřich Libovický

    Abstract: In this paper, we describe our submissions to the WMT17 Multimodal Translation Task. For Task 1 (multimodal translation), our best scoring system is a purely textual neural translation of the source image caption to the target language. The main feature of the system is the use of additional data that was acquired by selecting similar sentences from parallel corpora and by data synthesis with back… ▽ More

    Submitted 14 July, 2017; originally announced July 2017.

    Comments: 8 pages; Camera-ready submission to WMT17

    ACM Class: I.2.7

  31. arXiv:1704.06567  [pdf, other

    cs.CL cs.NE

    Attention Strategies for Multi-Source Sequence-to-Sequence Learning

    Authors: **dřich Libovický, **dřich Helcl

    Abstract: Modeling attention in neural multi-source sequence-to-sequence learning remains a relatively unexplored area, despite its usefulness in tasks that incorporate multiple source languages or modalities. We propose two novel approaches to combine the outputs of attention mechanisms over each source sequence, flat and hierarchical. We compare the proposed methods with existing techniques and present re… ▽ More

    Submitted 21 April, 2017; originally announced April 2017.

    Comments: 7 pages; Accepted to ACL 2017

    MSC Class: 68T50 ACM Class: I.2.7

  32. arXiv:1606.07481  [pdf, other

    cs.CL

    CUNI System for WMT16 Automatic Post-Editing and Multimodal Translation Tasks

    Authors: **dřich Libovický, **dřich Helcl, Marek Tlustý, Pavel Pecina, Ondřej Bojar

    Abstract: Neural sequence to sequence learning recently became a very promising paradigm in machine translation, achieving competitive results with statistical phrase-based systems. In this system description paper, we attempt to utilize several recently published methods used for neural sequential learning in order to build systems for WMT 2016 shared tasks of Automatic Post-Editing and Multimodal Machine… ▽ More

    Submitted 23 June, 2016; originally announced June 2016.

    Comments: Accepted to the First Conference of Machine Translation (WMT16)