Skip to main content

Showing 1–10 of 10 results for author: Rybak, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.04507  [pdf, other

    cs.CL

    NLPre: a revised approach towards language-centric benchmarking of Natural Language Preprocessing systems

    Authors: Martyna Wiącek, Piotr Rybak, Łukasz Pszenny, Alina Wróblewska

    Abstract: With the advancements of transformer-based architectures, we observe the rise of natural language preprocessing (NLPre) tools capable of solving preliminary NLP tasks (e.g. tokenisation, part-of-speech tagging, dependency parsing, or morphological analysis) without any external linguistic guidance. It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-… ▽ More

    Submitted 27 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: Accepted at LREC-COLING 2024

  2. arXiv:2402.14408  [pdf, other

    cs.CL

    Transferring BERT Capabilities from High-Resource to Low-Resource Languages Using Vocabulary Matching

    Authors: Piotr Rybak

    Abstract: Pre-trained language models have revolutionized the natural language understanding landscape, most notably BERT (Bidirectional Encoder Representations from Transformers). However, a significant challenge remains for low-resource languages, where limited data hinders the effective training of such models. This work presents a novel approach to bridge this gap by transferring BERT capabilities from… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  3. arXiv:2309.08469  [pdf, other

    cs.CL cs.IR

    Silver Retriever: Advancing Neural Passage Retrieval for Polish Question Answering

    Authors: Piotr Rybak, Maciej Ogrodniczuk

    Abstract: Modern open-domain question answering systems often rely on accurate and efficient retrieval components to find passages containing the facts necessary to answer the question. Recently, neural retrievers have gained popularity over lexical alternatives due to their superior performance. However, most of the work concerns popular languages such as English or Chinese. For others, such as Polish, few… ▽ More

    Submitted 22 February, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

  4. arXiv:2305.05486  [pdf, other

    cs.CL

    MAUPQA: Massive Automatically-created Polish Question Answering Dataset

    Authors: Piotr Rybak

    Abstract: Recently, open-domain question answering systems have begun to rely heavily on annotated datasets to train neural passage retrievers. However, manually annotating such datasets is both difficult and time-consuming, which limits their availability for less popular languages. In this work, we experiment with several methods for automatically collecting weakly labeled datasets and show how they affec… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

  5. arXiv:2305.05474  [pdf, other

    cs.CL cs.LG

    Going beyond research datasets: Novel intent discovery in the industry setting

    Authors: Aleksandra Chrabrowa, Tsimur Hadeliya, Dariusz Kajtoch, Robert Mroczkowski, Piotr Rybak

    Abstract: Novel intent discovery automates the process of grou** similar messages (questions) to identify previously unknown intents. However, current research focuses on publicly available datasets which have only the question field and significantly differ from real-life datasets. This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the be… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted to Findings of EACL 2023

  6. arXiv:2212.08897  [pdf, other

    cs.CL

    PolQA: Polish Question Answering Dataset

    Authors: Piotr Rybak, Piotr Przybyła, Maciej Ogrodniczuk

    Abstract: Recently proposed systems for open-domain question answering (OpenQA) require large amounts of training data to achieve state-of-the-art performance. However, data annotation is known to be time-consuming and therefore expensive to acquire. As a result, the appropriate datasets are available only for a handful of languages (mainly English and Chinese). In this work, we introduce and publicly relea… ▽ More

    Submitted 22 February, 2024; v1 submitted 17 December, 2022; originally announced December 2022.

  7. arXiv:2205.08808  [pdf, other

    cs.CL cs.LG

    Evaluation of Transfer Learning for Polish with a Text-to-Text Model

    Authors: Aleksandra Chrabrowa, Łukasz Dragan, Karol Grzegorczyk, Dariusz Kajtoch, Mikołaj Koszowski, Robert Mroczkowski, Piotr Rybak

    Abstract: We introduce a new benchmark for assessing the quality of text-to-text models for Polish. The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering. In particular, since summarization and question answering lack benchmark datasets for the Polish language, we describe their construction and make them publi… ▽ More

    Submitted 18 May, 2022; originally announced May 2022.

    Comments: Accepted at LREC 2022

  8. arXiv:2105.01735  [pdf, other

    cs.CL cs.LG

    HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish

    Authors: Robert Mroczkowski, Piotr Rybak, Alina Wróblewska, Ireneusz Gawlik

    Abstract: BERT-based models are currently used for solving nearly all Natural Language Processing (NLP) tasks and most often achieve state-of-the-art results. Therefore, the NLP community conducts extensive research on understanding these models, but above all on designing effective and efficient training procedures. Several ablation studies investigating how to train BERT-like models have been carried out,… ▽ More

    Submitted 4 May, 2021; originally announced May 2021.

    Comments: Published in Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

  9. arXiv:2005.00630  [pdf, other

    cs.CL

    KLEJ: Comprehensive Benchmark for Polish Language Understanding

    Authors: Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik

    Abstract: In recent years, a series of Transformer-based models unlocked major improvements in general natural language understanding (NLU) tasks. Such a fast pace of research would not be possible without general NLU benchmarks, which allow for a fair comparison of the proposed methods. However, such benchmarks are available only for a handful of languages. To alleviate this issue, we introduce a comprehen… ▽ More

    Submitted 1 May, 2020; originally announced May 2020.

  10. Semi-Supervised Neural System for Tagging, Parsing and Lematization

    Authors: Piotr Rybak, Alina Wróblewska

    Abstract: This paper describes the ICS PAS system which took part in CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. The system consists of jointly trained tagger, lemmatizer, and dependency parser which are based on features extracted by a biLSTM network. The system uses both fully connected and dilated convolutional neural architectures. The novelty of our approach… ▽ More

    Submitted 26 April, 2020; originally announced April 2020.