Skip to main content

Showing 1–17 of 17 results for author: Mareček, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.18913  [pdf, other

    cs.CL cs.AI stat.ML

    Debiasing Algorithm through Model Adaptation

    Authors: Tomasz Limisiewicz, David Mareček, Tomáš Musil

    Abstract: Large language models are becoming the go-to solution for the ever-growing number of tasks. However, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data. This work proposes a novel method for detecting and mitigating gender bias in language models. We perform causal analysis to identify problematic model compone… ▽ More

    Submitted 29 May, 2024; v1 submitted 29 October, 2023; originally announced October 2023.

    Comments: Accepted to ICLR 2024

  2. arXiv:2309.12491  [pdf, other

    cs.CL cs.AI

    Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation

    Authors: Bar Iluz, Tomasz Limisiewicz, Gabriel Stanovsky, David Mareček

    Abstract: We study the effect of tokenization on gender bias in machine translation, an aspect that has been largely overlooked in previous works. Specifically, we focus on the interactions between the frequency of gendered profession names in training data, their representation in the subword tokenizer's vocabulary, and gender bias. We observe that female and non-stereotypical gender inflections of profess… ▽ More

    Submitted 30 September, 2023; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: Accepted to AACL 2023

  3. arXiv:2306.11899  [pdf, other

    physics.data-an cond-mat.mtrl-sci cs.LG physics.app-ph

    Closing the loop: Autonomous experiments enabled by machine-learning-based online data analysis in synchrotron beamline environments

    Authors: Linus Pithan, Vladimir Starostin, David Mareček, Lukas Petersdorf, Constantin Völter, Valentin Munteanu, Maciej Jankowski, Oleg Konovalov, Alexander Gerlach, Alexander Hinderhofer, Bridget Murphy, Stefan Kowarik, Frank Schreiber

    Abstract: Recently, there has been significant interest in applying machine learning (ML) techniques to X-ray scattering experiments, which proves to be a valuable tool for enhancing research that involves large or rapidly generated datasets. ML allows for the automated interpretation of experimental results, particularly those obtained from synchrotron or neutron facilities. The speed at which ML models ca… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

  4. arXiv:2305.17179  [pdf, other

    cs.CL

    Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

    Authors: Tomasz Limisiewicz, Jiří Balhar, David Mareček

    Abstract: Multilingual language models have recently gained attention as a promising solution for representing multiple languages in a single model. In this paper, we propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers. Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: in ACL Findings 2023

  5. arXiv:2212.09580  [pdf, other

    cs.CL

    Independent Components of Word Embeddings Represent Semantic Features

    Authors: Tomáš Musil, David Mareček

    Abstract: Independent Component Analysis (ICA) is an algorithm originally developed for finding separate sources in a mixed signal, such as a recording of multiple people in the same room speaking at the same time. It has also been used to find linguistic features in distributional representations. In this paper, we used ICA to analyze words embeddings. We have found that ICA can be used to find semantic fe… ▽ More

    Submitted 19 December, 2022; originally announced December 2022.

  6. arXiv:2206.10744  [pdf, other

    cs.CL cs.AI

    Don't Forget About Pronouns: Removing Gender Bias in Language Models Without Losing Factual Gender Information

    Authors: Tomasz Limisiewicz, David Mareček

    Abstract: The representations in large language models contain multiple types of gender information. We focus on two types of such signals in English texts: factual gender information, which is a grammatical or semantic property, and gender bias, which is the correlation between a word and specific gender. We can disentangle the model's embeddings and identify components encoding both types of information w… ▽ More

    Submitted 21 June, 2022; originally announced June 2022.

    Comments: Presented at GeBNLP 2022

  7. arXiv:2109.04921  [pdf, other

    cs.CL cs.AI

    Examining Cross-lingual Contextual Embeddings with Orthogonal Structural Probes

    Authors: Tomasz Limisiewicz, David Mareček

    Abstract: State-of-the-art contextual embeddings are obtained from large language models available only for a few languages. For others, we need to learn representations using a multilingual model. There is an ongoing debate on whether multilingual embeddings can be aligned in a space shared across many languages. The novel Orthogonal Structural Probe (Limisiewicz and Mareček, 2021) allows us to answer this… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021 Main Conference

  8. arXiv:2102.08892  [pdf, ps, other

    cs.CL cs.HC

    THEaiTRE 1.0: Interactive generation of theatre play scripts

    Authors: Rudolf Rosa, Tomáš Musil, Ondřej Dušek, Dominik Jurko, Patrícia Schmidtová, David Mareček, Ondřej Bojar, Tom Kocmi, Daniel Hrbek, David Košťák, Martina Kinská, Marie Nováková, Josef Doležal, Klára Vosecká, Tomáš Studeník, Petr Žabka

    Abstract: We present the first version of a system for interactive generation of theatre play scripts. The system is based on a vanilla GPT-2 model with several adjustments, targeting specific issues we encountered in practice. We also list other issues we encountered but plan to only solve in a future version of the system. The presented system was used to generate a theatre play script planned for premier… ▽ More

    Submitted 17 February, 2021; originally announced February 2021.

    Comments: Submitted to Text2Story workshop 2021

    Journal ref: Proc. Text2Story (2021) 71-76

  9. arXiv:2012.15228  [pdf, other

    cs.CL

    Introducing Orthogonal Constraint in Structural Probes

    Authors: Tomasz Limisiewicz, David Mareček

    Abstract: With the recent success of pre-trained models in NLP, a significant focus was put on interpreting their representations. One of the most prominent approaches is structural probing (Hewitt and Manning, 2019), where a linear projection of word embeddings is performed in order to approximate the topology of dependency structures. In this work, we introduce a new type of structural probing, where the… ▽ More

    Submitted 23 June, 2021; v1 submitted 30 December, 2020; originally announced December 2020.

  10. arXiv:2010.01063  [pdf, other

    cs.CL

    Syntax Representation in Word Embeddings and Neural Networks -- A Survey

    Authors: Tomasz Limisiewicz, David Mareček

    Abstract: Neural networks trained on natural language processing tasks capture syntax even though it is not provided as a supervision signal. This indicates that syntactic analysis is essential to the understating of language in artificial intelligence systems. This overview paper covers approaches of evaluating the amount of syntactic information included in the representations of words for different neura… ▽ More

    Submitted 2 October, 2020; originally announced October 2020.

    Journal ref: Proceedings of the 20th Conference ITAT 2020: Automata, Formal and Natural Languages Workshop

  11. Measuring Memorization Effect in Word-Level Neural Networks Probing

    Authors: Rudolf Rosa, Tomáš Musil, David Mareček

    Abstract: Multiple studies have probed representations emerging in neural networks trained for end-to-end NLP tasks and examined what word-level linguistic information may be encoded in the representations. In classical probing, a classifier is trained on the representations to extract the target linguistic information. However, there is a threat of the classifier simply memorizing the linguistic labels for… ▽ More

    Submitted 29 June, 2020; originally announced June 2020.

    Comments: Accepted to TSD 2020. Will be published in Springer LNCS

    Journal ref: LNCS 12284, TSD (2020) 180-188

  12. arXiv:2006.14668  [pdf, ps, other

    cs.CL

    THEaiTRE: Artificial Intelligence to Write a Theatre Play

    Authors: Rudolf Rosa, Ondřej Dušek, Tom Kocmi, David Mareček, Tomáš Musil, Patrícia Schmidtová, Dominik Jurko, Ondřej Bojar, Daniel Hrbek, David Košťák, Martina Kinská, Josef Doležal, Klára Vosecká

    Abstract: We present THEaiTRE, a starting project aimed at automatic generation of theatre play scripts. This paper reviews related work and drafts an approach we intend to follow. We plan to adopt generative neural language models and hierarchical generation approaches, supported by summarization and machine translation methods, and complemented with a human-in-the-loop approach.

    Submitted 25 June, 2020; originally announced June 2020.

    Comments: accepted to AI4Narratives2020

    Journal ref: Proc. AI4Narratives (2020) 9-13

  13. Universal Dependencies according to BERT: both more specific and more general

    Authors: Tomasz Limisiewicz, Rudolf Rosa, David Mareček

    Abstract: This work focuses on analyzing the form and extent of syntactic abstraction captured by BERT by extracting labeled dependency trees from self-attentions. Previous work showed that individual BERT heads tend to encode particular dependency relation types. We extend these findings by explicitly comparing BERT relations to Universal Dependencies (UD) annotations, showing that they often do not matc… ▽ More

    Submitted 6 October, 2020; v1 submitted 30 April, 2020; originally announced April 2020.

    Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2020

  14. arXiv:1906.11511  [pdf, other

    cs.CL

    Inducing Syntactic Trees from BERT Representations

    Authors: Rudolf Rosa, David Mareček

    Abstract: We use the English model of BERT and explore how a deletion of one word in a sentence changes representations of other words. Our hypothesis is that removing a reducible word (e.g. an adjective) does not affect the representation of other words so much as removing e.g. the main verb, which makes the sentence ungrammatical and of "high surprise" for the language model. We estimate reducibilities of… ▽ More

    Submitted 27 June, 2019; originally announced June 2019.

    Comments: Accepted abstract for the BlackboxNLP 2019

  15. arXiv:1906.02510  [pdf, ps, other

    cs.CL

    Derivational Morphological Relations in Word Embeddings

    Authors: Tomáš Musil, Jonáš Vidra, David Mareček

    Abstract: Derivation is a type of a word-formation process which creates new words from existing ones by adding, changing or deleting affixes. In this paper, we explore the potential of word embeddings to identify properties of word derivations in the morphologically rich Czech language. We extract derivational relations between pairs of words from DeriNet, a Czech lexical network, which organizes almost on… ▽ More

    Submitted 6 June, 2019; originally announced June 2019.

    Comments: 8 pages, accepted to BlackBox NLP workshop collocated with ACL 2019 in Florence

  16. arXiv:1906.01958  [pdf, other

    cs.CL cs.LG

    From Balustrades to Pierre Vinken: Looking for Syntax in Transformer Self-Attentions

    Authors: David Mareček, Rudolf Rosa

    Abstract: We inspect the multi-head self-attention in Transformer NMT encoders for three source languages, looking for patterns that could have a syntactic interpretation. In many of the attention heads, we frequently find sequences of consecutive states attending to the same position, which resemble syntactic phrases. We propose a transparent deterministic method of quantifying the amount of syntactic info… ▽ More

    Submitted 5 June, 2019; originally announced June 2019.

    Comments: Accepted at BlackboxNLP 2019

  17. arXiv:1811.04716  [pdf, other

    cs.CL

    Input Combination Strategies for Multi-Source Transformer Decoder

    Authors: **dřich Libovický, **dřich Helcl, David Mareček

    Abstract: In multi-source sequence-to-sequence tasks, the attention mechanism can be modeled in several ways. This topic has been thoroughly studied on recurrent architectures. In this paper, we extend the previous work to the encoder-decoder attention in the Transformer architecture. We propose four different input combination strategies for the encoder-decoder attention: serial, parallel, flat, and hierar… ▽ More

    Submitted 12 November, 2018; originally announced November 2018.

    Comments: Published at WMT18