-
Document Understanding Dataset and Evaluation (DUDE)
Authors:
Jordy Van Landeghem,
Rubén Tito,
Łukasz Borchmann,
Michał Pietruszka,
Paweł Józiak,
Rafał Powalski,
Dawid Jurkiewicz,
Mickaël Coustaty,
Bertrand Ackaert,
Ernest Valveny,
Matthew Blaschko,
Sien Moens,
Tomasz Stanisławek
Abstract:
We call on the Document AI (DocAI) community to reevaluate current methodologies and embrace the challenge of creating more practically-oriented benchmarks. Document Understanding Dataset and Evaluation (DUDE) seeks to remediate the halted research progress in understanding visually-rich documents (VRDs). We present a new dataset with novelties related to types of questions, answers, and document…
▽ More
We call on the Document AI (DocAI) community to reevaluate current methodologies and embrace the challenge of creating more practically-oriented benchmarks. Document Understanding Dataset and Evaluation (DUDE) seeks to remediate the halted research progress in understanding visually-rich documents (VRDs). We present a new dataset with novelties related to types of questions, answers, and document layouts based on multi-industry, multi-domain, and multi-page VRDs of various origins, and dates. Moreover, we are pushing the boundaries of current methods by creating multi-task and multi-domain evaluation setups that more accurately simulate real-world situations where powerful generalization and adaptation under low-resource settings are desired. DUDE aims to set a new standard as a more practical, long-standing benchmark for the community, and we hope that it will lead to future extensions and contributions that address real-world challenges. Finally, our work illustrates the importance of finding more efficient ways to model language, images, and layout in DocAI.
△ Less
Submitted 11 September, 2023; v1 submitted 15 May, 2023;
originally announced May 2023.
-
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
Authors:
Rafał Powalski,
Łukasz Borchmann,
Dawid Jurkiewicz,
Tomasz Dwojak,
Michał Pietruszka,
Gabriela Pałka
Abstract:
We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attenti…
▽ More
We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.
△ Less
Submitted 12 July, 2021; v1 submitted 18 February, 2021;
originally announced February 2021.
-
UniCase -- Rethinking Casing in Language Models
Authors:
Rafal Powalski,
Tomasz Stanislawek
Abstract:
In this paper, we introduce a new approach to dealing with the problem of case-sensitiveness in Language Modelling (LM). We propose simple architecture modification to the RoBERTa language model, accompanied by a new tokenization strategy, which we named Unified Case LM (UniCase). We tested our solution on the GLUE benchmark, which led to increased performance by 0.42 points. Moreover, we prove th…
▽ More
In this paper, we introduce a new approach to dealing with the problem of case-sensitiveness in Language Modelling (LM). We propose simple architecture modification to the RoBERTa language model, accompanied by a new tokenization strategy, which we named Unified Case LM (UniCase). We tested our solution on the GLUE benchmark, which led to increased performance by 0.42 points. Moreover, we prove that the UniCase model works much better when we have to deal with text data, where all tokens are uppercased (+5.88 point).
△ Less
Submitted 22 October, 2020;
originally announced October 2020.
-
LAMBERT: Layout-Aware (Language) Modeling for information extraction
Authors:
Łukasz Garncarek,
Rafał Powalski,
Tomasz Stanisławek,
Bartosz Topolski,
Piotr Halama,
Michał Turski,
Filip Graliński
Abstract:
We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch. We only augment the input of the model with the coordinates of token…
▽ More
We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch. We only augment the input of the model with the coordinates of token bounding boxes, avoiding, in this way, the use of raw images. This leads to a layout-aware language model which can then be fine-tuned on downstream tasks.
The model is evaluated on an end-to-end information extraction task using four publicly available datasets: Kleister NDA, Kleister Charity, SROIE and CORD. We show that our model achieves superior performance on datasets consisting of visually rich documents, while also outperforming the baseline RoBERTa on documents with flat layout (NDA \(F_{1}\) increase from 78.50 to 80.42). Our solution ranked first on the public leaderboard for the Key Information Extraction from the SROIE dataset, improving the SOTA \(F_{1}\)-score from 97.81 to 98.17.
△ Less
Submitted 28 May, 2021; v1 submitted 19 February, 2020;
originally announced February 2020.