Skip to main content

Showing 1–7 of 7 results for author: Stanislawek, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2305.08455  [pdf, other

    cs.CV cs.CL cs.LG

    Document Understanding Dataset and Evaluation (DUDE)

    Authors: Jordy Van Landeghem, Rubén Tito, Łukasz Borchmann, Michał Pietruszka, Paweł Józiak, Rafał Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Ackaert, Ernest Valveny, Matthew Blaschko, Sien Moens, Tomasz Stanisławek

    Abstract: We call on the Document AI (DocAI) community to reevaluate current methodologies and embrace the challenge of creating more practically-oriented benchmarks. Document Understanding Dataset and Evaluation (DUDE) seeks to remediate the halted research progress in understanding visually-rich documents (VRDs). We present a new dataset with novelties related to types of questions, answers, and document… ▽ More

    Submitted 11 September, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

    Comments: Accepted at ICCV 2023

  2. arXiv:2304.14953  [pdf, other

    cs.CL

    CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

    Authors: Michał Turski, Tomasz Stanisławek, Karol Kaczmarek, Paweł Dyda, Filip Graliński

    Abstract: In recent years, the field of document understanding has progressed a lot. A significant part of this progress has been possible thanks to the use of language models pretrained on large amounts of documents. However, pretraining corpora used in the domain of document understanding are single domain, monolingual, or nonpublic. Our goal in this paper is to propose an efficient pipeline for creating… ▽ More

    Submitted 6 June, 2023; v1 submitted 28 April, 2023; originally announced April 2023.

    Comments: Accepted at ICDAR 2023

  3. Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

    Authors: Tomasz Stanisławek, Filip Graliński, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek

    Abstract: The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language docum… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

    Comments: accepted to ICDAR 2021

    Journal ref: International Conference on Document Analysis and Recognition ICDAR 2021

  4. arXiv:2010.11936  [pdf, other

    cs.CL cs.AI

    UniCase -- Rethinking Casing in Language Models

    Authors: Rafal Powalski, Tomasz Stanislawek

    Abstract: In this paper, we introduce a new approach to dealing with the problem of case-sensitiveness in Language Modelling (LM). We propose simple architecture modification to the RoBERTa language model, accompanied by a new tokenization strategy, which we named Unified Case LM (UniCase). We tested our solution on the GLUE benchmark, which led to increased performance by 0.42 points. Moreover, we prove th… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  5. arXiv:2003.02356  [pdf, other

    cs.CL

    Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout

    Authors: Filip Graliński, Tomasz Stanisławek, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek

    Abstract: State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence-level context or document-level context for short documents. But these solutions are still struggling when it comes to longer, real-world documents with the information encoded in the spatial structure of the document, such as page elements like tables, forms, headers,… ▽ More

    Submitted 6 March, 2020; v1 submitted 4 March, 2020; originally announced March 2020.

  6. LAMBERT: Layout-Aware (Language) Modeling for information extraction

    Authors: Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Michał Turski, Filip Graliński

    Abstract: We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch. We only augment the input of the model with the coordinates of token… ▽ More

    Submitted 28 May, 2021; v1 submitted 19 February, 2020; originally announced February 2020.

    Comments: accepted to ICDAR 2021

    Journal ref: In: Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science, vol 12821. Springer, Cham

  7. Named Entity Recognition -- Is there a glass ceiling?

    Authors: Tomasz Stanislawek, Anna Wróblewska, Alicja Wójcicka, Daniel Ziembicki, Przemyslaw Biecek

    Abstract: Recent developments in Named Entity Recognition (NER) have resulted in better and better models. However, is there a glass ceiling? Do we know which types of errors are still hard or even impossible to correct? In this paper, we present a detailed analysis of the types of errors in state-of-the-art machine learning (ML) methods. Our study reveals the weak and strong points of the Stanford, CMU, FL… ▽ More

    Submitted 30 October, 2019; v1 submitted 6 October, 2019; originally announced October 2019.

    Comments: Accepted to CoNLL 2019

    Journal ref: 23rd Conference on Computational Natural Language Learning, CoNLL 2019