Skip to main content

Showing 1–14 of 14 results for author: Limisiewicz, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.10691  [pdf, other

    cs.CL cs.AI cs.LG

    MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

    Authors: Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer

    Abstract: A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically m… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  2. arXiv:2401.10440  [pdf, other

    cs.CL

    Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

    Authors: Terra Blevins, Tomasz Limisiewicz, Suchin Gururangan, Margaret Li, Hila Gonen, Noah A. Smith, Luke Zettlemoyer

    Abstract: Despite their popularity in non-English NLP, multilingual language models often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert Language Models (X-ELM), which mitigate this competition by independently training language models on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while rem… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

  3. arXiv:2310.18913  [pdf, other

    cs.CL cs.AI stat.ML

    Debiasing Algorithm through Model Adaptation

    Authors: Tomasz Limisiewicz, David Mareček, Tomáš Musil

    Abstract: Large language models are becoming the go-to solution for the ever-growing number of tasks. However, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data. This work proposes a novel method for detecting and mitigating gender bias in language models. We perform causal analysis to identify problematic model compone… ▽ More

    Submitted 29 May, 2024; v1 submitted 29 October, 2023; originally announced October 2023.

    Comments: Accepted to ICLR 2024

  4. arXiv:2309.12491  [pdf, other

    cs.CL cs.AI

    Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation

    Authors: Bar Iluz, Tomasz Limisiewicz, Gabriel Stanovsky, David Mareček

    Abstract: We study the effect of tokenization on gender bias in machine translation, an aspect that has been largely overlooked in previous works. Specifically, we focus on the interactions between the frequency of gendered profession names in training data, their representation in the subword tokenizer's vocabulary, and gender bias. We observe that female and non-stereotypical gender inflections of profess… ▽ More

    Submitted 30 September, 2023; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: Accepted to AACL 2023

  5. arXiv:2305.17179  [pdf, other

    cs.CL

    Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

    Authors: Tomasz Limisiewicz, Jiří Balhar, David Mareček

    Abstract: Multilingual language models have recently gained attention as a promising solution for representing multiple languages in a single model. In this paper, we propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers. Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: in ACL Findings 2023

  6. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  7. arXiv:2210.07135  [pdf, other

    cs.CL

    You Can Have Your Data and Balance It Too: Towards Balanced and Efficient Multilingual Models

    Authors: Tomasz Limisiewicz, Dan Malkin, Gabriel Stanovsky

    Abstract: Multilingual models have been widely used for cross-lingual transfer to low-resource languages. However, the performance on these languages is hindered by their underrepresentation in the pretraining data. To alleviate this problem, we propose a novel multilingual training technique based on teacher-student knowledge distillation. In this setting, we utilize monolingual teacher models optimized fo… ▽ More

    Submitted 26 May, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: SIGTYP 2023

  8. arXiv:2206.10744  [pdf, other

    cs.CL cs.AI

    Don't Forget About Pronouns: Removing Gender Bias in Language Models Without Losing Factual Gender Information

    Authors: Tomasz Limisiewicz, David Mareček

    Abstract: The representations in large language models contain multiple types of gender information. We focus on two types of such signals in English texts: factual gender information, which is a grammatical or semantic property, and gender bias, which is the correlation between a word and specific gender. We can disentangle the model's embeddings and identify components encoding both types of information w… ▽ More

    Submitted 21 June, 2022; originally announced June 2022.

    Comments: Presented at GeBNLP 2022

  9. arXiv:2205.04086  [pdf, other

    cs.CL cs.AI

    A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Map** the Linguistic Blood Bank

    Authors: Dan Malkin, Tomasz Limisiewicz, Gabriel Stanovsky

    Abstract: We show that the choice of pretraining languages affects downstream cross-lingual transfer for BERT-based models. We inspect zero-shot performance in balanced data conditions to mitigate data size confounds, classifying pretraining languages that improve downstream performance as donors, and languages that are improved in zero-shot performance as recipients. We develop a method of quadratic time c… ▽ More

    Submitted 9 May, 2022; originally announced May 2022.

    Comments: Accepted to NAACL 2022

  10. arXiv:2109.04921  [pdf, other

    cs.CL cs.AI

    Examining Cross-lingual Contextual Embeddings with Orthogonal Structural Probes

    Authors: Tomasz Limisiewicz, David Mareček

    Abstract: State-of-the-art contextual embeddings are obtained from large language models available only for a few languages. For others, we need to learn representations using a multilingual model. There is an ongoing debate on whether multilingual embeddings can be aligned in a space shared across many languages. The novel Orthogonal Structural Probe (Limisiewicz and Mareček, 2021) allows us to answer this… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021 Main Conference

  11. arXiv:2012.15228  [pdf, other

    cs.CL

    Introducing Orthogonal Constraint in Structural Probes

    Authors: Tomasz Limisiewicz, David Mareček

    Abstract: With the recent success of pre-trained models in NLP, a significant focus was put on interpreting their representations. One of the most prominent approaches is structural probing (Hewitt and Manning, 2019), where a linear projection of word embeddings is performed in order to approximate the topology of dependency structures. In this work, we introduce a new type of structural probing, where the… ▽ More

    Submitted 23 June, 2021; v1 submitted 30 December, 2020; originally announced December 2020.

  12. arXiv:2010.06018  [pdf, ps, other

    cs.CL

    Gender Coreference and Bias Evaluation at WMT 2020

    Authors: Tom Kocmi, Tomasz Limisiewicz, Gabriel Stanovsky

    Abstract: Gender bias in machine translation can manifest when choosing gender inflections based on spurious gender correlations. For example, always translating doctors as men and nurses as women. This can be particularly harmful as models become more popular and deployed within commercial systems. Our work presents the largest evidence for the phenomenon in more than 19 systems submitted to the WMT over f… ▽ More

    Submitted 12 October, 2020; originally announced October 2020.

    Comments: Accepted WMT20

  13. arXiv:2010.01063  [pdf, other

    cs.CL

    Syntax Representation in Word Embeddings and Neural Networks -- A Survey

    Authors: Tomasz Limisiewicz, David Mareček

    Abstract: Neural networks trained on natural language processing tasks capture syntax even though it is not provided as a supervision signal. This indicates that syntactic analysis is essential to the understating of language in artificial intelligence systems. This overview paper covers approaches of evaluating the amount of syntactic information included in the representations of words for different neura… ▽ More

    Submitted 2 October, 2020; originally announced October 2020.

    Journal ref: Proceedings of the 20th Conference ITAT 2020: Automata, Formal and Natural Languages Workshop

  14. Universal Dependencies according to BERT: both more specific and more general

    Authors: Tomasz Limisiewicz, Rudolf Rosa, David Mareček

    Abstract: This work focuses on analyzing the form and extent of syntactic abstraction captured by BERT by extracting labeled dependency trees from self-attentions. Previous work showed that individual BERT heads tend to encode particular dependency relation types. We extend these findings by explicitly comparing BERT relations to Universal Dependencies (UD) annotations, showing that they often do not matc… ▽ More

    Submitted 6 October, 2020; v1 submitted 30 April, 2020; originally announced April 2020.

    Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2020