-
Document Understanding Dataset and Evaluation (DUDE)
Authors:
Jordy Van Landeghem,
Rubén Tito,
Łukasz Borchmann,
Michał Pietruszka,
Paweł Józiak,
Rafał Powalski,
Dawid Jurkiewicz,
Mickaël Coustaty,
Bertrand Ackaert,
Ernest Valveny,
Matthew Blaschko,
Sien Moens,
Tomasz Stanisławek
Abstract:
We call on the Document AI (DocAI) community to reevaluate current methodologies and embrace the challenge of creating more practically-oriented benchmarks. Document Understanding Dataset and Evaluation (DUDE) seeks to remediate the halted research progress in understanding visually-rich documents (VRDs). We present a new dataset with novelties related to types of questions, answers, and document…
▽ More
We call on the Document AI (DocAI) community to reevaluate current methodologies and embrace the challenge of creating more practically-oriented benchmarks. Document Understanding Dataset and Evaluation (DUDE) seeks to remediate the halted research progress in understanding visually-rich documents (VRDs). We present a new dataset with novelties related to types of questions, answers, and document layouts based on multi-industry, multi-domain, and multi-page VRDs of various origins, and dates. Moreover, we are pushing the boundaries of current methods by creating multi-task and multi-domain evaluation setups that more accurately simulate real-world situations where powerful generalization and adaptation under low-resource settings are desired. DUDE aims to set a new standard as a more practical, long-standing benchmark for the community, and we hope that it will lead to future extensions and contributions that address real-world challenges. Finally, our work illustrates the importance of finding more efficient ways to model language, images, and layout in DocAI.
△ Less
Submitted 11 September, 2023; v1 submitted 15 May, 2023;
originally announced May 2023.
-
STable: Table Generation Framework for Encoder-Decoder Models
Authors:
Michał Pietruszka,
Michał Turski,
Łukasz Borchmann,
Tomasz Dwojak,
Gabriela Pałka,
Karolina Szyndler,
Dawid Jurkiewicz,
Łukasz Garncarek
Abstract:
The output structure of database-like tables, consisting of values structured in horizontal rows and vertical columns identifiable by name, can cover a wide range of NLP tasks. Following this constatation, we propose a framework for text-to-table neural models applicable to problems such as extraction of line items, joint entity and relation extraction, or knowledge base population. The permutatio…
▽ More
The output structure of database-like tables, consisting of values structured in horizontal rows and vertical columns identifiable by name, can cover a wide range of NLP tasks. Following this constatation, we propose a framework for text-to-table neural models applicable to problems such as extraction of line items, joint entity and relation extraction, or knowledge base population. The permutation-based decoder of our proposal is a generalized sequential method that comprehends information from all cells in the table. The training maximizes the expected log-likelihood for a table's content across all random permutations of the factorization order. During the content inference, we exploit the model's ability to generate cells in any order by searching over possible orderings to maximize the model's confidence and avoid substantial error accumulation, which other sequential models are prone to. Experiments demonstrate a high practical value of the framework, which establishes state-of-the-art results on several challenging datasets, outperforming previous solutions by up to 15%.
△ Less
Submitted 12 October, 2022; v1 submitted 8 June, 2022;
originally announced June 2022.
-
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
Authors:
Rafał Powalski,
Łukasz Borchmann,
Dawid Jurkiewicz,
Tomasz Dwojak,
Michał Pietruszka,
Gabriela Pałka
Abstract:
We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attenti…
▽ More
We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.
△ Less
Submitted 12 July, 2021; v1 submitted 18 February, 2021;
originally announced February 2021.
-
From Dataset Recycling to Multi-Property Extraction and Beyond
Authors:
Tomasz Dwojak,
Michał Pietruszka,
Łukasz Borchmann,
Jakub Chłędowski,
Filip Graliński
Abstract:
This paper investigates various Transformer architectures on the WikiReading Information Extraction and Machine Reading Comprehension dataset. The proposed dual-source model outperforms the current state-of-the-art by a large margin. Next, we introduce WikiReading Recycled-a newly developed public dataset and the task of multiple property extraction. It uses the same data as WikiReading but does n…
▽ More
This paper investigates various Transformer architectures on the WikiReading Information Extraction and Machine Reading Comprehension dataset. The proposed dual-source model outperforms the current state-of-the-art by a large margin. Next, we introduce WikiReading Recycled-a newly developed public dataset and the task of multiple property extraction. It uses the same data as WikiReading but does not inherit its predecessor's identified disadvantages. In addition, we provide a human-annotated test set with diagnostic subsets for a detailed analysis of model performance.
△ Less
Submitted 6 November, 2020;
originally announced November 2020.
-
Successive Halving Top-k Operator
Authors:
Michał Pietruszka,
Łukasz Borchmann,
Filip Graliński
Abstract:
We propose a differentiable successive halving method of relaxing the top-k operator, rendering gradient-based optimization possible. The need to perform softmax iteratively on the entire vector of scores is avoided by using a tournament-style selection. As a result, a much better approximation of top-k with lower computational cost is achieved compared to the previous approach.
We propose a differentiable successive halving method of relaxing the top-k operator, rendering gradient-based optimization possible. The need to perform softmax iteratively on the entire vector of scores is avoided by using a tournament-style selection. As a result, a much better approximation of top-k with lower computational cost is achieved compared to the previous approach.
△ Less
Submitted 8 October, 2020;
originally announced October 2020.
-
Sparsifying Transformer Models with Trainable Representation Pooling
Authors:
Michał Pietruszka,
Łukasz Borchmann,
Łukasz Garncarek
Abstract:
We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-$k$ operator. Our experiments on a challenging long document summarizat…
▽ More
We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-$k$ operator. Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with trainable pooling, we can retain its top quality, while being $1.8\times$ faster during training, $4.5\times$ faster during inference, and up to $13\times$ more computationally efficient in the decoder.
△ Less
Submitted 7 March, 2022; v1 submitted 10 September, 2020;
originally announced September 2020.
-
On the Multi-Property Extraction and Beyond
Authors:
Tomasz Dwojak,
Michał Pietruszka,
Łukasz Borchmann,
Filip Graliński,
Jakub Chłędowski
Abstract:
In this paper, we investigate the Dual-source Transformer architecture on the WikiReading information extraction and machine reading comprehension dataset. The proposed model outperforms the current state-of-the-art by a large margin. Next, we introduce WikiReading Recycled - a newly developed public dataset, supporting the task of multiple property extraction. It keeps the spirit of the original…
▽ More
In this paper, we investigate the Dual-source Transformer architecture on the WikiReading information extraction and machine reading comprehension dataset. The proposed model outperforms the current state-of-the-art by a large margin. Next, we introduce WikiReading Recycled - a newly developed public dataset, supporting the task of multiple property extraction. It keeps the spirit of the original WikiReading but does not inherit the identified disadvantages of its predecessor.
△ Less
Submitted 15 June, 2020;
originally announced June 2020.