HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: arydshln

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2401.13313v1 [cs.CV] 24 Jan 2024

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

Ryota Tanaka1,2, Taichi Iki1, Kyosuke Nishida1, Kuniko Saito1, Jun Suzuki2
Abstract

We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.

Introduction

Building document artificial intelligence (Document AI) capable of reading and comprehending real-world documents, including webpages, office documents, mobile UIs, etc., has been a long-cherished goal. Toward this goal, numerous works on visual document understanding (VDU) have addressed a wide range of tasks, such as document question answering (QA) (Mathew, Karatzas, and Jawahar 2021) and information extraction (Jaume, Ekenel, and Thiran 2019). Document data contain both textual and visual objects, with content spread structurally across various locations depending on diverse document types and formats. To address this complexity, previous works have proposed models that aim to improve interactions among text/layout/visual modalities (Xu et al. 2021; Appalaraju et al. 2021). However, the diversity of documents and tasks poses a challenge in develo** a unified model that can comprehend intricate relationships between text and visual objects across a wide range of document types, formats, and tasks.

To improve the generalizability and adaptivity of unseen vision-language tasks, visual instruction tuning (Xu, Shen, and Huang 2023; Liu et al. 2023a) has been introduced. This approach involves training multimodal large language models (MLLMs) on a collection of images, task inputs, and instructions. However, according to  (Liu et al. 2023b), most of the previous visual instruction tuning datasets have primarily focused on understanding visual (non-textual) objects in scene images and existing models struggle with accomplishing tasks that require visual document understanding abilities. While recent works (Zhang et al. 2023; Ye et al. 2023a) attempt to deal with the issue, they still exhibit limitations when generalizing to unseen tasks and documents.

In this paper, we propose InstructDoc111Our dataset and codes are publicly available at https://github.com/nttmdlab-nlp/InstructDoc, the first large-scale visual instruction tuning dataset that covers a wide range of VDU tasks and datasets (12 diverse tasks created from 30 openly available datasets). Each dataset has diverse instructions annotated by experts, following a unified instruction schema, composed of user’s intent and answer style, for VDU tasks. As shown in Figure 1, InstructDoc requires a rich set of abilities, including understanding document layout, visual representations of texts, and relation extraction of objects (e.g., graphs and charts) over open document types/formats with handcrafted instructions.

Furthermore, to enhance the generalization performance on VDU tasks, we present a Instruction-based Document reading and understanding model, InstructDr, which unifies the visual, text, and layout modalities of documents by bridging the gap between a vision encoder and a large language model (LLM) through a new bridging module called Document-former. The Document-former converts documents into a useful feature for the LLM. Experiments show that InstructDr achieves the highest zero-shot performance among existing MLLMs and outperforms ChatGPT on a wide range of VDU datasets with instructions.

Refer to caption
Figure 1: Examples of InstructDoc dataset. The input defines intent, query and options, and answer style. query and options and outputs are from original datasets. We annotated instructions composed of intent and answer style or only intent.

Related Work

Visual document understanding.

Visual documents are ubiquitous and used in diverse applications, including QA on business documents (Mathew, Karatzas, and Jawahar 2021), information extraction on receipts (Jaume, Ekenel, and Thiran 2019), and classification over large document collections (Harley, Ufkes, and Derpanis 2015). Due to this diversity, previous works have generally been domain/task-specific, lacking the sharing of underlying data, model architectures, and objectives (Xu et al. 2020; Appalaraju et al. 2021; Huang et al. 2022). Although pixel-based methods (Kim et al. 2022; Lee et al. 2023) can simplify architectures, these methods have high computational costs (due to the encoding of high-resolution images) and can have degraded performance on new tasks. We leverage the reasoning abilities of LLMs and perform all VDU tasks in a unified sequence-to-sequence format with instructions, resulting in improved generalization performance.

Instruction-following language models.

Training LLMs with instructions on various NLP tasks has proven effective in improving zero-shot performance of unseen tasks (Wei et al. 2021; Iyer et al. 2022). Flan (Wei et al. 2021; Longpre et al. 2023), PromptSource (Bach et al. 2022), and Natural Instructions (Mishra et al. 2022) collected instructions and datasets for a variety of general NLP tasks, such as machine reading comprehension and summarization tasks on plain-text documents. In contrast, we tackle the challenge of understanding real-world documents organized in non-plain text formats (e.g., HTML and PDF).

Visual instruction tuning.

Researchers have recently explored the application of LLMs to vision-language tasks by distilling the output of LLMs (Liu et al. 2023a; Zhu et al. 2023; Ye et al. 2023b) or training with handcrafted instructions (Xu, Shen, and Huang 2023; Dai et al. 2023). However, as pointed out in (Liu et al. 2023b), these models struggle with tasks requiring document understanding abilities because they do not assume that text might be contained in images during instruction tuning. To mitigate this issue, LLaVAR (Zhang et al. 2023) and LLMDoc (Ye et al. 2023a) fine-tune MLLMs with instruction tuning on document images. However, these approaches have trouble understanding diverse real-world documents because (i) the datasets provide a few document and task types, hindering zero-shot generalization; and (ii) the models simply encode documents via vision encoders and cannot explicitly learn document meta-information (e.g., document layout). In contrast, the InstructDoc covers diverse VDU tasks and open document types/formats, and InstructDr learns rich representations of the underlying structure of documents with instructions.

InstructDoc Dataset

Refer to caption
Figure 2: Datasets used in InstructDoc. InstructDoc covers a wide range of VDU tasks and document types and formats.

Problem Formulation

All of the tasks in InstructDoc are simply defined as: given an instruction T𝑇Titalic_T and a document image I𝐼Iitalic_I, a model outputs an answer A𝐴Aitalic_A. Each task is composed of one or more datasets, where the dataset 𝒟𝒟\mathcal{D}caligraphic_D is associated with the set of K𝐾Kitalic_K instructions 𝒯𝒟={T1𝒟,,TK𝒟}superscript𝒯𝒟subscriptsuperscript𝑇𝒟1subscriptsuperscript𝑇𝒟𝐾\mathcal{T^{\mathcal{D}}}=\{T^{\mathcal{D}}_{1},...,T^{\mathcal{D}}_{K}\}caligraphic_T start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT = { italic_T start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } and contains N𝑁Nitalic_N instances {(𝒯𝒟,Ij,Aj)}j=1Nsubscriptsuperscriptsuperscript𝒯𝒟subscript𝐼𝑗subscript𝐴𝑗𝑁𝑗1\{(\mathcal{T^{\mathcal{D}}},I_{j},A_{j})\}^{N}_{j=1}{ ( caligraphic_T start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT. Here, we randomly select the instruction from 𝒯𝒟superscript𝒯𝒟\mathcal{T^{\mathcal{D}}}caligraphic_T start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT for every instance. Note that we allow the utilization of external OCR engines to derive the answer in our setting, as in the previous VDU benchmark (Borchmann et al. 2021). Our goal is to enable the model to perform a wide range of VDU tasks with instructions rather than improving the accuracy of text recognition (Zhang et al. 2023).

We mainly evaluate the models’ ability to perform zero-shot learning scenarios. Specifically, we fine-tune a model on a collection of instruction tasks and evaluate it on unseen datasets defined three types: (i) TestCross-DatasetCross-Dataset{}_{\text{Cross-Dataset}}start_FLOATSUBSCRIPT Cross-Dataset end_FLOATSUBSCRIPT: datasets not used during training, but whose tasks exist in training set; (ii) TestCross-TaskCross-Task{}_{\text{Cross-Task}}start_FLOATSUBSCRIPT Cross-Task end_FLOATSUBSCRIPT: datasets and associated tasks entirely unseen during training; and (iii) TestCross-DomainCross-Domain{}_{\text{Cross-Domain}}start_FLOATSUBSCRIPT Cross-Domain end_FLOATSUBSCRIPT: datasets, tasks, and document types entirely unseen during training.

Dataset Collection

In this section, we describe the collection process of the InstructDoc dataset. InstructDoc is designed to cover a wide range of VDU tasks with instructions that require reasoning among document layout, images, and text.

Source dataset collection.

Figure 2 shows the source datasets in InstructDoc. We collected 30 publicly available datasets and 12 tasks in VDU areas from DUE (Borchmann et al. 2021) as well as through manual searches. Following the task clusters defined in previous works (Wei et al. 2021; Dai et al. 2023), we divided the QA datasets that require different reasoning abilities into different tasks. As a result, we divided the collected datasets into the following tasks:

  • Key Information Extraction (KIE) assigns each word a semantic entity label from predefined categories (Šimsa et al. 2023; Jaume, Ekenel, and Thiran 2019; Sun et al. 2021; Park et al. 2019; Huang et al. 2019).

  • Single-page QA is a task of QA on single-page documents and focuses on document layout and textual content understanding (Tanaka, Nishida, and Yoshida 2021; Chen et al. 2021; Mishra et al. 2019; Mathew, Karatzas, and Jawahar 2021; Tüselmann et al. 2022).

  • Single-page QA w/ Discrete Reasoning requires various arithmetic abilities, including addition, sorting, or counting (Zhu et al. 2022).

  • Single-page QA w/ Visual Reasoning requires a set of abilities, including object (e.g., icon) recognition, commonsense understanding, and relation extraction on single-page documents (Lu et al. 2021; Kembhavi et al. 2016; Lu et al. 2022; Kembhavi et al. 2016).

  • Single-page QA w/ Discrete & Visual Reasoning requires both discrete and visual reasoning (Mathew et al. 2022; Masry et al. 2022) on single-page documents.

  • Multi-page QA w/ Multi-hop & Discrete & Visual Reasoning requires understanding the content relationship via multi-hop reasoning as well as discrete/visual reasoning on multi-page documents (Tanaka et al. 2023; Landeghem et al. 2023).

  • Document NLI is a task of natural language inference that predicts the entailment relationship between two sentences in a document (Borchmann et al. 2021)

  • Dialogue involves a human-agent interaction on the basis of document images (Zhang et al. 2023).

  • Captioning involves producing descriptions of documents (Hsu, Giles, and Huang 2021; Wang et al. 2021).

  • Classification involves classifying a document from a set of candidate labels (Harley, Ufkes, and Derpanis 2015).

  • Document Layout Analysis (DLA) determines a document’s components with bounding boxes (Li et al. 2020; Pfitzmann et al. 2022)

  • Image-Text Matching (ITM) requires the model to determine whether a given OCR text and image match.

Refer to caption
Figure 3: InstructDr model. We update only the parameters of Document-former and the projection FFN layer during training.

Query rephrasing.

We found that two KIE datasets (FUNSD and CORD) are challenging because they contain abbreviated queries that are difficult for humans to comprehend. To bridge the gap between humans and machines, we replace these queries with complete and more easily understandable phrases (e.g., menu.vatyn \to menu_whether_price_tax_included).

Instruction annotation.

For each dataset, we manually crafted five to ten distinct instruction templates in a unified format. For QA tasks, the answers have diverse styles in the original datasets; for example, DocVQA’s answer is extractive, which requires the model to extract a contiguous span of words from the document, but VisualMRC’s answer is generative, which is not limited to the word spans. Hence, an instruction that sufficiently describes an arbitrary VDU task should include intent and answer style or only intent. Specifically, as shown in Figure 1, intent describes how the task can be performed and answer style describes how the model generates the output. If each dataset provides query and options, we fill it in annotated instruction templates.

Data split.

We split InstructDoc into 23 held-in and seven held-out datasets. For the held-out evaluation, we aim to understand how instruction tuning on the held-in datasets improves the zero-shot generalization performance on unseen datasets, including (i) TestCross-DatasetCross-Dataset{}_{\text{Cross-Dataset}}start_FLOATSUBSCRIPT Cross-Dataset end_FLOATSUBSCRIPT: FUNSD and CORD datasets, (ii) TestCross-TaskCross-Task{}_{\text{Cross-Task}}start_FLOATSUBSCRIPT Cross-Task end_FLOATSUBSCRIPT: ChartQA, InfoVQA, and TabFact datasets, and (iii) TestCross-DomainCross-Domain{}_{\text{Cross-Domain}}start_FLOATSUBSCRIPT Cross-Domain end_FLOATSUBSCRIPT: DUDE and SlideVQA datasets. All other datasets were held-in ones to train our model. Note that the held-out datasets were carefully selected in order to avoid data contamination.

LLaVAR DocOwl InstructDoc
Both Single/Multi-page docs
Instruction annotation
Answer style annotation
#Document types 8 7 Open
#Seed datasets 1 8 30
#Task clusters 1 3 12
#Avg.±Std.plus-or-minusStd.{}_{\pm\text{Std.}}start_FLOATSUBSCRIPT ± Std. end_FLOATSUBSCRIPT IT words - 5±0plus-or-minus0{}_{\pm 0}start_FLOATSUBSCRIPT ± 0 end_FLOATSUBSCRIPT 20.3±11.2plus-or-minus11.2{}_{\pm 11.2}start_FLOATSUBSCRIPT ± 11.2 end_FLOATSUBSCRIPT
#Avg.±Std.plus-or-minusStd.{}_{\pm\text{Std.}}start_FLOATSUBSCRIPT ± Std. end_FLOATSUBSCRIPT IT - 1±0plus-or-minus0{}_{\pm 0}start_FLOATSUBSCRIPT ± 0 end_FLOATSUBSCRIPT 7.4±2.4plus-or-minus2.4{}_{\pm 2.4}start_FLOATSUBSCRIPT ± 2.4 end_FLOATSUBSCRIPT
#Avg.±Std.plus-or-minusStd.{}_{\pm\text{Std.}}start_FLOATSUBSCRIPT ± Std. end_FLOATSUBSCRIPT OCR words 52.5±93.1plus-or-minus93.1{}_{\pm 93.1}start_FLOATSUBSCRIPT ± 93.1 end_FLOATSUBSCRIPT 270.1±807.2plus-or-minus807.2{}_{\pm 807.2}start_FLOATSUBSCRIPT ± 807.2 end_FLOATSUBSCRIPT 443.2±1442.8plus-or-minus1442.8{}_{\pm 1442.8}start_FLOATSUBSCRIPT ± 1442.8 end_FLOATSUBSCRIPT
#Avg.±Std.plus-or-minusStd.{}_{\pm\text{Std.}}start_FLOATSUBSCRIPT ± Std. end_FLOATSUBSCRIPT Answer words 34.5±27.5plus-or-minus27.5{}_{\pm 27.5}start_FLOATSUBSCRIPT ± 27.5 end_FLOATSUBSCRIPT 1.9±2.7plus-or-minus2.7{}_{\pm 2.7}start_FLOATSUBSCRIPT ± 2.7 end_FLOATSUBSCRIPT 5.88±17.7plus-or-minus17.7{}_{\pm 17.7}start_FLOATSUBSCRIPT ± 17.7 end_FLOATSUBSCRIPT
Table 1: Statistics of InstructDoc and other VDU instruction tuning datasets. We excluded data other than the VDU tasks from DocOwl. IT denotes instruction templates.

Comparison with Related Datasets

Table 1 shows the statistics of InstructDoc and other VDU instruction tuning datasets, including LLaVAR (Zhang et al. 2023) and DocOwl (Ye et al. 2023a). InstructDoc has three unique key properties; First, it is the first dataset to address open document types, including multi-page documents and has the highest standard deviation in the number of OCR tokens (1442.8) compared with LLaVAR (93.1) and DocOwl (807.2). This implies that our dataset is a more challenging setting. Second, InstructDoc covers the widest range of tasks, offering four times more tasks compared with DocOwl, while LLaVAR provides only a single task. Finally, InstructDoc provides a more extensive set of instructions (20.3 words and 7.4 templates) and annotates various answer styles within the instructions to deal with various VDU tasks that require diverse abilities. In contrast, the instructions in DocOwl are limited (five words and a single template) and LLaVAR has only machine-generated instructions, and they may not generalize well to reformulations and new tasks.

Our Model

Figure 3 depicts our model, InstructDr (Instruction-based Dument reading and understanding model). We use pre-trained BLIP-2 (Li et al. 2023), a state-of-the-art MLLM connected with instruction-tuned FlanT5 (Chung et al. 2022), as the base model. We extend BLIP-2 in three key ways; (i) equip** it with Document-former, an enhanced Q-former module that can capture and convert the visual and textual content/layout of documents into representations of the LLM, (ii) conducting multi-task instruction tuning with unified formats, and (iii) encoding multiple images in parallel to facilitate understanding of multi-page documents.

Cross-Dataset Cross-Task Cross-Domain
Model Modal #TuP #ToP FUNSD CORD ChartQA InfoVQA TabFact DUDE SlideVQA Held-out
eF1/F1 eF1/F1 RAcc./F1 ANLS/F1 Acc./F1 ANLS/F1 EM/F1 Avg.
LLMDoc V 388M 7B -/- -/- -/- 38.2†/- 60.2†/- -/- -/- -/-
LLaVA TV 13B 13B 12.0/1.3 0.2/ 5.1 0.0/1.7 3.4/3.5 0.0/0.0 6.5/5.9 0.0/2.3 3.1/2.8
LLaVAR TV 13B 13B 12.0/2.0 0.1/10.8 0.0/3.0 6.2/4.6 0.0/2.1 8.1/5.1 0.0/6.2 3.8/4.8
MiniGPT-4 TV 3.1M 7B 12.0/2.2 0.2/ 2.1 0.0/0.4 4.3/0.5 0.3/0.2 5.9/1.1 0.0/0.4 3.2/1.0
mPLUG-Owl TV 388M 7B 12.0/6.7 0.2/15.0 0.0/0.3 5.6/5.3 0.0/2.6 5.8/5.5 0.0/0.4 3.4/5.1
InstructBLIP TV 103M 3.4B 16.8/15.0 4.9/9.5 3.3/7.2 8.7/7.3 33.6/33.7 11.0/8.8 5.2/9.0 11.9/12.9
BLIP-2 TV 103M 3.4B 19.6/19.6 32.0/51.9 23.6/21.5 48.2/36.7 58.6/58.6 39.8/35.4 28.3/38.8 35.7/37.5
BLIP-2 trained on IDoc TV 103M 3.4B 26.0/26.1 33.8/54.7 24.7/21.2 47.8/35.4 58.8/58.8 43.9/40.4 30.1/38.8 37.9/39.3
InstructDr (Ours) TLV 103.1M 3.4B 38.2/38.1 46.0/62.7 29.4/22.3 50.9/37.6 59.4/59.4 45.2/41.6 31.9/40.2 43.0/43.1
Table 2: Zero-shot performance of InstructDr and MLLMs on VDU tasks. “T/L/V” denotes the “text/layout/visual” modality of documents. #TuP/#ToP denotes the number of tuning/total parameters. The highest zero-shot performances are marked in bold. †denotes the supervised performance reported in the original paper, as it is not publicly available. IDoc denotes InstructDoc.

Spatial-aware Document Feature Extraction

Document image/OCR and instruction encoding.

To encode a document image, we use a pre-trained CLIP (Radford et al. 2021) vision encoder to extract its visual features 𝐳vissuperscript𝐳vis\mathbf{z}^{\text{vis}}bold_z start_POSTSUPERSCRIPT vis end_POSTSUPERSCRIPT. Additionally, we process the document image using an OCR engine and apply a sub-word tokenizer to obtain M𝑀Mitalic_M word tokens {si}i=1Msuperscriptsubscriptsubscript𝑠𝑖𝑖1𝑀\{s_{i}\}_{i=1}^{M}{ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and their corresponding bounding boxes {(xi1,yi1,xi2,yi2)}i=1Msuperscriptsubscriptsuperscriptsubscript𝑥𝑖1superscriptsubscript𝑦𝑖1superscriptsubscript𝑥𝑖2superscriptsubscript𝑦𝑖2𝑖1𝑀\{(x_{i}^{1},y_{i}^{1},x_{i}^{2},y_{i}^{2})\}_{i=1}^{M}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, where (x1superscript𝑥1x^{1}italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, y1superscript𝑦1y^{1}italic_y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT) and (x2superscript𝑥2x^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, y2superscript𝑦2y^{2}italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) represent the coordinates of the top-left and bottom-right corners, respectively. To learn the visual layout of the image, we construct a spatially aware OCR representation 𝐳iocr=𝐳iword+𝐳ibboxsuperscriptsubscript𝐳𝑖ocrsuperscriptsubscript𝐳𝑖wordsuperscriptsubscript𝐳𝑖bbox\mathbf{z}_{i}^{\text{ocr}}=\mathbf{z}_{i}^{\text{word}}+\mathbf{z}_{i}^{\text% {bbox}}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ocr end_POSTSUPERSCRIPT = bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT word end_POSTSUPERSCRIPT + bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bbox end_POSTSUPERSCRIPT with learnable embedding layers 𝐖{s,x,y,h,w}superscript𝐖𝑠𝑥𝑦𝑤\mathbf{W}^{\{s,x,y,h,w\}}bold_W start_POSTSUPERSCRIPT { italic_s , italic_x , italic_y , italic_h , italic_w } end_POSTSUPERSCRIPT, where OCR text features are calculated as 𝐳iword=𝐖s(si)subscriptsuperscript𝐳word𝑖superscript𝐖𝑠subscript𝑠𝑖\mathbf{z}^{\text{word}}_{i}=\mathbf{W}^{s}(s_{i})bold_z start_POSTSUPERSCRIPT word end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_W start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and spatial features are calculated as 𝐳ibbox=𝐖x(xi1,xi2)+𝐖y(yi1,yi2)+𝐖h(yi2yi1)+𝐖w(xi2xi1)subscriptsuperscript𝐳bbox𝑖superscript𝐖𝑥subscriptsuperscript𝑥1𝑖subscriptsuperscript𝑥2𝑖superscript𝐖𝑦subscriptsuperscript𝑦1𝑖subscriptsuperscript𝑦2𝑖superscript𝐖subscriptsuperscript𝑦2𝑖subscriptsuperscript𝑦1𝑖superscript𝐖𝑤subscriptsuperscript𝑥2𝑖subscriptsuperscript𝑥1𝑖\mathbf{z}^{\text{bbox}}_{i}=\mathbf{W}^{x}(x^{1}_{i},x^{2}_{i})+\mathbf{W}^{y% }(y^{1}_{i},y^{2}_{i})+\mathbf{W}^{h}(y^{2}_{i}-y^{1}_{i})+\mathbf{W}^{w}(x^{2% }_{i}-x^{1}_{i})bold_z start_POSTSUPERSCRIPT bbox end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_W start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_W start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_W start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_W start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Similarly, we encode an instruction by 𝐖ssuperscript𝐖𝑠\mathbf{W}^{s}bold_W start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and obtain its features 𝐳inssuperscript𝐳ins\mathbf{z}^{\text{ins}}bold_z start_POSTSUPERSCRIPT ins end_POSTSUPERSCRIPT.

Document-former.

We introduce Document-former, which is a trainable module to bridge the gap between an image encoder and an LLM, enabling extraction of document content/layout that LLMs can understand. The architecture of Document-former is a stack of Transformer blocks with cross-attention layers. To map document features into the LLM’s space, we use a set of m𝑚mitalic_m learnable tokens 𝐳tokendsuperscript𝐳tokensuperscript𝑑\mathbf{z}^{\text{token}}\in\mathbb{R}^{d}bold_z start_POSTSUPERSCRIPT token end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, where d𝑑ditalic_d is the dimension of the hidden size. These tokens 𝐳tokensuperscript𝐳token\mathbf{z}^{\text{token}}bold_z start_POSTSUPERSCRIPT token end_POSTSUPERSCRIPT interact with 𝐳vissuperscript𝐳vis\mathbf{z}^{\text{vis}}bold_z start_POSTSUPERSCRIPT vis end_POSTSUPERSCRIPT through cross-attention layers and with the input sequence, composed of 𝐳inssuperscript𝐳ins\mathbf{z}^{\text{ins}}bold_z start_POSTSUPERSCRIPT ins end_POSTSUPERSCRIPT and 𝐳ocrsuperscript𝐳ocr\mathbf{z}^{\text{ocr}}bold_z start_POSTSUPERSCRIPT ocr end_POSTSUPERSCRIPT, through self-attention layers. As a result, we obtain 𝐳docsuperscript𝐳doc\mathbf{z}^{\text{doc}}bold_z start_POSTSUPERSCRIPT doc end_POSTSUPERSCRIPT and transform it via a projection feed-forward network (FFN) layer to 𝐡docm×dLLMsuperscript𝐡docsuperscript𝑚superscript𝑑LLM\mathbf{h}^{\text{doc}}\in\mathbb{R}^{m\times d^{\text{LLM}}}bold_h start_POSTSUPERSCRIPT doc end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d start_POSTSUPERSCRIPT LLM end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, which have the same dimension dLLMsuperscript𝑑LLMd^{\text{LLM}}italic_d start_POSTSUPERSCRIPT LLM end_POSTSUPERSCRIPT as the LLM’s input embedding.

Multimodal Document Large Language Model

Connecting document features to LLM.

The LLM receives the document embeddings 𝐡docsuperscript𝐡doc\mathbf{h}^{\text{doc}}bold_h start_POSTSUPERSCRIPT doc end_POSTSUPERSCRIPT, the instruction, and OCR tokens as input and outputs the answer 𝐀𝐀\mathbf{A}bold_A, token by token. The parameters of the LLM are initialized from an instruction-tuned FlanT5.

Parameter-efficient multi-task instruction tuning.

To achieve task-agnostic learning, we formulate the process of learning all held-in tasks in a unified sequence-to-sequence abstraction through instructions. To train the LLM efficiently, we update only the parameters of the Document-former (including 𝐖{s,x,y,h,w}superscript𝐖𝑠𝑥𝑦𝑤\mathbf{W}^{\{s,x,y,h,w\}}bold_W start_POSTSUPERSCRIPT { italic_s , italic_x , italic_y , italic_h , italic_w } end_POSTSUPERSCRIPT) and the projection FFN layer, while kee** other parameters frozen during training. We optimize the model by minimizing the negative log-likelihood between the ground-truth and predictions.

Multi-page document understanding.

We also support performing reasoning across multiple document pages. As shown in Figure 3b, each image is processed individually by the image encoder and Document-former, and their resulting document embeddings are mean-pooled together before being fed into the LLM. The OCR input to the LLM consists of concatenated tokens extracted from each page.

Experiments

Cross-Dataset Cross-Task Cross-Domain
Model Modal FUNSD CORD ChartQA InfoVQA TabFact DUDE SlideVQA Held-out
eF1/F1 eF1/F1 RAcc./F1 ANLS/F1 Acc./F1 ANLS/F1 EM/F1 Avg.
Supervised SOTA models TLV 92.1/- 97.7/- 72.3/- 54.8*/- 83.2*/- 46.1*/- 33.5/41.7 -/-
ChatGPT T 21.8/21.2 30.4/49.3 16.0/16.8 37.8/29.5 52.5/52.4 34.5/32.3 11.7/23.8 29.2/32.2
GPT-4 T 47.5/47.5 69.4/81.7 20.9/27.6 49.9/46.5 68.8/68.8 46.3/45.1 21.0/36.4 46.3/50.5
InstructDr (Ours) TLV 38.2/38.1 46.0/62.7 29.4/22.3 50.9/37.6 59.4/59.4 45.2/ 41.6 31.9/40.2 43.0/43.1
Table 3: Zero-shot performance on VDU tasks of InstructDr and supervised SOTA models and powerful text-based LLMs. * denotes the performance on different splits we used because they evaluated on the leaderboard and F1 cannot be used.

Experimental Setup

We mainly conducted evaluations under three zero-shot settings, including TestCross-DatasetCross-Dataset{}_{\text{Cross-Dataset}}start_FLOATSUBSCRIPT Cross-Dataset end_FLOATSUBSCRIPT, TestCross-TaskCross-Task{}_{\text{Cross-Task}}start_FLOATSUBSCRIPT Cross-Task end_FLOATSUBSCRIPT, and TestCross-DomainCross-Domain{}_{\text{Cross-Domain}}start_FLOATSUBSCRIPT Cross-Domain end_FLOATSUBSCRIPT. Furthermore, we evaluated our model under the task-specific fine-tuning setting.

Baselines.

We compared InstructDr with seven state-of-the-art (SOTA) MLLMs, including LLaVA (Liu et al. 2023a), MiniGPT-4 (Zhu et al. 2023) and mPLUG-Owl (Ye et al. 2023b), which align CLIP visual encoder with Vicuna (Chiang et al. 2023) trained on a dialogue generated by GPT-4 (OpenAI 2023); BLIP-2 (Li et al. 2023), which connects a FlanT5 with a vision encoder; InstructBLIP (Dai et al. 2023), which fine-tunes BLIP-2 with instructions on scene images; and LLMDoc (Ye et al. 2023a) and LLaVAR (Zhang et al. 2023), which fine-tune mPULG-Owl/LLaVA on the DocOwl/LLaVAR datasets. Additionally, we used Supervised SOTA models (Appalaraju et al. 2023; Chen et al. 2023; Huang et al. 2022; Landeghem et al. 2023) on each dataset and two text-based LLMs, ChatGPT (gpt-3.5-turbo-0613) and GPT-4. To control the answer’s length, we added control phrases (e.g., use 1 to 3 words to answer) to the selected instructions.

Evaluation metrics.

We followed the evaluation protocol of each dataset, we used ANLS (Biten et al. 2019) for InfoVQA, DUDE, Text-VQA and ST-VQA, EM for SlideVQA, Relaxed Accuracy (RAcc.) for ChartQA, entity F1 (eF1) for FUNSD and CORD, Accuracy (Acc.) for TabFact, and ROUGE-L for VisualMRC as evaluation metrics. Additionally, we used F1 as the optional metrics.

CORD TabFact DUDE Held-out
Model eF1 Acc. ANLS Avg.
InstructDr 46.0 59.4 45.2 43.0
w/o Document-former 38.538.538.538.5 58.858.858.858.8 44.644.644.644.6 40.240.240.240.2
w/o Spatially OCR features 33.833.833.833.8 58.858.858.858.8 43.943.943.943.9 37.937.937.937.9
w/o Mean pooling (concat.) - - 43.843.843.843.8 -
w/o Instructions in test 24.024.024.024.0 4.04.04.04.0 38.938.938.938.9 28.028.028.028.0
w/o Instructions in train 17.317.317.317.3 58.258.258.258.2 34.034.034.034.0 28.928.928.928.9
w/o Instructions in both 0.40.40.40.4 3.73.73.73.7 24.424.424.424.4 21.321.321.321.3
w/o Query rephrasing 30.930.930.930.9 - - -
w/o Answer style annotation - - 44.244.244.244.2 -
Table 4: Ablation study of the architecture and instructions. We report the scores when the ablation can be conducted.

Implementation details.

Following (Wei et al. 2021), we balanced the training instances of different tasks by sampling a maximum of 5k instances for each held-in dataset while kee** all evaluation instances. We used the AdamW (Loshchilov and Hutter 2017) with a weight decay of 0.05. We applied a linear warmup during the initial 1,000 steps and used a cosine learning rate decay with a minimum learning rate of 0. We set the number of learnable tokens m𝑚mitalic_m to 32323232. All images of the model input were resized to 224224224224. We trained on eight A100 (40G) GPUs for three epochs and completed the training within two hours. If each dataset does not provide OCR, we extracted it via the Google Vision API.

Experimental Results and Analysis

Does our model outperform existing MLLMs?

Table 2 shows that our model achieved the highest performance on all datasets compared with other MLLMs. InstructDr consistently outperformed its original backbone, BLIP-2, by a significant margin, indicating that instruction tuning on InstructDoc effectively enhances performance on unseen VDU datasets, tasks, and domains. In contrast, InstructBLIP, which is instruction-tuned BLIP-2 trained on scene images, performed worse than BLIP-2. This is because that InstructBLIP does not assume that the images might contain text during instruction tuning. BLIP-2 fine-tuned on InstructDoc falls short of achieving the same level of performance compared with InstructDr, indicating that InstructDr is better suited for comprehending diverse real-world documents. This conclusion is further supported by the results presented in Table 4, where ablations of Document-former, spatial information, and strategy of gathering multi-page features have a significant negative impact on performance.

How well does our model perform in comparison with supervised SOTA models and powerful LLMs?

As shown in Table 3, our model outperformed ChatGPT on all datasets. Additionally, InstructDr achieved competitive results with supervised SOTA models and GPT-4 on the DUDE and SlideVQA datasets that require multiple reasoning skills (e.g., discrete, visual, and multi-hop reasoning). This indicates that our model can effectively learn diverse skills through instruction tuning with InstructDoc.

Refer to caption
Figure 4: Comparison of zero-shot performance on DUDE for five different instructions. w/o Multiple instructions denotes our model trained with a single instruction per dataset.
Refer to caption
Figure 5: Model performance as the number of task clusters used in training. (\cdot) denotes the number of tasks.

What is the role of instructions?

As shown in Table 4, removing instructions (i.e., only query and options as the model input) significantly decreased zero-shot performance during training or/and test time, indicating the effectiveness of incorporating instructions. This result was observed on the high-quality instruction-tuning datasets (Wei et al. 2021; Xu, Shen, and Huang 2023). Moreover, our instruction annotations, including query rephrasing and answer styles, helped to improve the zero-shot performance.

Refer to caption
Figure 6: Qualitative examples. Outputs are correct/sufficient and incorrect/insufficient answers. (…) denotes ellipsis.

Does our model have robustness towards diverse instructions?

Figure 4 shows the performance variance when the models were given five different instructions; InstructDr exhibited the smallest performance variance and outperformed the other models. This indicates InstructDoc empowers the model with the ability to deal with a variety of instructions. Our results also suggest that using multiple instructions per dataset is important for achieving decent performance.

What is the impact of diverse task clusters?

As shown in Figure 5, as the number of task clusters increases, we can observe an improvement in models’ zero-shot performance.

VisualMRC DUDE SlideVQA
Model ROUGE-L ANLS EM F1
Supervised SOTA models 52.2 46.1 33.5 41.7
BLIP-2 60.5 45.6 36.9 46.5
InstructDr 61.1 46.8 37.7 47.3
Table 5: Fine-tuning performance in held-in (VisualMRC) and held-out (DUDE, SlideVQA) tasks on the test set.
Image type in TextVQA ST-VQA
Model instruction tuning Acc. ANLS ANLS
BLIP-2 - 48.7 64.8 39.1
InstructBLIP Daily scene 52.8 67.3 45.7
InstructDr Documents 53.8 68.1 43.3
Table 6: Zero-shot performance of scene-text VQA task.

Are our model weights effective for task-specific fine-tuning?

We further fine-tuned InstructDr (only Document-former module) on a specific dataset to investigate the knowledge and transferability of our instruction-tuned model weights. Table 5 shows the fine-tuning performance on held-in (VisualMRC) and held-out (DUDE, SlideVQA) tasks. InstructDr achieved state-of-the-art finetuning performance on VisualMRC, DUDE, and SlideVQA using a unified model. Compared with BLIP-2, InstructDr exhibited superior fine-tuning performance on both held-in/out datasets, validating InstructDr as a better weight initialization model for task-specific fine-tuning.

Can our model also understand images other than documents?

Table 6 shows the zero-shot performance of scene-text VQA (Singh et al. 2019; Biten et al. 2019) on scene images, which are the unseen image types in InstructDoc but were used for training our base model, BLIP-2. Note that ST-VQA’s images include the part of COCO (Lin et al. 2014) that InstructBLIP was trained on. This result indicates that InstructDr can effectively learn visual reasoning skills without forgetting the abilities of the original backbone.

Qualitative examples.

Figure 6 visualizes output examples, where the left/center/right examples require table/visual/hand-written text understanding skills. ChatGPT gave incorrect answers because it can only consider text information. Moreover, while BLIP-2 could not follow instructions (e.g., use 5 to 10 words) and extract items from structured text, InstructDr accomplished diverse VDU tasks with instructions. As shown in the right example, all models affected OCR quality, causing incorrect answers.

Limitations

Despite its impressive performance on various VDU tasks with instructions, InstructDr suffers from noisy OCR predictions, whose performance depends highly on OCR text qualities (right of Figure 6). We argue that our approach is more cost-efficient and accurate because another approach, the pixel-based ones (Kim et al. 2022; Chen et al. 2023), requires a large amount of computation to encode high-resolution images and cannot use document meta-information (e.g., bounding boxes). Moreover, since InstructDoc only contains a single document-text pair per instance, it cannot learn the correlation among multiple document-text pairs and lacks an in-context learning capability. The same observation has also been reported in the Flamingo (Alayrac et al. 2022) and BLIP-2. Finally, while we have constructed diverse VDU tasks, the number of tasks and corresponding instructions are still limited. We plan to consider utilizing automatic generation and augmentation techniques to increase the variety of instructions available.

Conclusion

We introduced a new large-scale instruction-tuning dataset, InstructDoc, to lay the foundation for building general-purpose VDU models that can follow natural language instructions. We also introduced a simple yet effective instruction tuning model, InstructDr, which unifies the vision, text, and layout modalities of documents by bridging the gap between a vision encoder and an LLM with Document-former. We performed a comprehensive study on instruction tuning with InstructDoc and demonstrated its generalization capability to a wide range of VDU datasets, tasks, and domains with instructions. We believe that our dataset will facilitate research on develo** general-purpose document artificial intelligence systems.

References

  • Alayrac et al. (2022) Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS.
  • Appalaraju et al. (2021) Appalaraju, S.; Jasani, B.; Kota, B. U.; Xie, Y.; and Manmatha, R. 2021. Docformer: End-to-End Transformer for Document Understanding. In CVPR, 993–1003.
  • Appalaraju et al. (2023) Appalaraju, S.; Tang, P.; Dong, Q.; Sankaran, N.; Zhou, Y.; and Manmatha, R. 2023. DocFormerv2: Local Features for Document Understanding. arXiv:2306.01733.
  • Bach et al. (2022) Bach, S.; Sanh, V.; Yong, Z. X.; Webson, A.; Raffel, C.; Nayak, N. V.; Sharma, A.; Kim, T.; Bari, M. S.; Fevry, T.; Alyafeai, Z.; Dey, M.; Santilli, A.; Sun, Z.; Ben-david, S.; Xu, C.; Chhablani, G.; Wang, H.; Fries, J.; Al-shaibani, M.; Sharma, S.; Thakker, U.; Almubarak, K.; Tang, X.; Radev, D.; Jiang, M. T.-j.; and Rush, A. 2022. PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts. In ACL-demo, 93–104.
  • Biten et al. (2019) Biten, A. F.; Tito, R.; Mafla, A.; i Bigorda, L. G.; Rusiñol, M.; Jawahar, C. V.; Valveny, E.; and Karatzas, D. 2019. Scene Text Visual Question Answering. In ICCV, 4290–4300.
  • Borchmann et al. (2021) Borchmann, Ł.; Pietruszka, M.; Stanislawek, T.; Jurkiewicz, D.; Turski, M.; Szyndler, K.; and Graliński, F. 2021. DUE: End-to-End Document Understanding Benchmark. In NeurIPS.
  • Chen et al. (2023) Chen, X.; Djolonga, J.; Padlewski, P.; Mustafa, B.; Changpinyo, S.; Wu, J.; Ruiz, C. R.; Goodman, S.; Wang, X.; Tay, Y.; et al. 2023. PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv:2305.18565.
  • Chen et al. (2021) Chen, X.; Zhao, Z.; Chen, L.; Ji, J.; Zhang, D.; Luo, A.; Xiong, Y.; and Yu, K. 2021. WebSRC: A Dataset for Web-Based Structural Reading Comprehension. In EMNLP, 4173–4185.
  • Chiang et al. (2023) Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J. E.; Stoica, I.; and Xing, E. P. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  • Chung et al. (2022) Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S.; et al. 2022. Scaling Instruction-Finetuned Language Models. arXiv:2210.11416.
  • Dai et al. (2023) Dai, W.; Li, J.; Li, D.; Tiong, A. M. H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; and Hoi, S. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500.
  • Harley, Ufkes, and Derpanis (2015) Harley, A. W.; Ufkes, A.; and Derpanis, K. G. 2015. Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval. In ICDAR, 991–995.
  • Hsu, Giles, and Huang (2021) Hsu, T.-Y.; Giles, C. L.; and Huang, T.-H. 2021. SciCap: Generating Captions for Scientific Figures. In EMNLP Findings, 3258–3264.
  • Huang et al. (2022) Huang, Y.; Lv, T.; Cui, L.; Lu, Y.; and Wei, F. 2022. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In ACMM, 4083–4091.
  • Huang et al. (2019) Huang, Z.; Chen, K.; He, J.; Bai, X.; Karatzas, D.; Lu, S.; and Jawahar, C. 2019. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In ICDAR, 1516–1520.
  • Iyer et al. (2022) Iyer, S.; Lin, X. V.; Pasunuru, R.; Mihaylov, T.; Simig, D.; Yu, P.; Shuster, K.; Wang, T.; Liu, Q.; Koura, P. S.; et al. 2022. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. arXiv:2212.12017.
  • Jaume, Ekenel, and Thiran (2019) Jaume, G.; Ekenel, H. K.; and Thiran, J.-P. 2019. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In ICDARW.
  • Kembhavi et al. (2016) Kembhavi, A.; Salvato, M.; Kolve, E.; Seo, M.; Hajishirzi, H.; and Farhadi, A. 2016. A Diagram Is Worth A Dozen Images. In ECCV, 235–251.
  • Kembhavi et al. (2017) Kembhavi, A.; Seo, M. J.; Schwenk, D.; Choi, J.; Farhadi, A.; and Hajishirzi, H. 2017. Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension. In CVPR, 5376–5384.
  • Kim et al. (2022) Kim, G.; Hong, T.; Yim, M.; Nam, J.; Park, J.; Yim, J.; Hwang, W.; Yun, S.; Han, D.; and Park, S. 2022. OCR-free Document Understanding Transformer. In ECCV, 498–517.
  • Krishna et al. (2017) Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.; Shamma, D. A.; Bernstein, M. S.; and Fei-Fei, L. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis., 123(1): 32–73.
  • Kuznetsova et al. (2020) Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. 2020. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 1956–1981.
  • Landeghem et al. (2023) Landeghem, J.; Tito, R.; Borchmann, Ł.; Pietruszka, M.; Józiak, P.; Powalski, R.; Jurkiewicz, D.; Coustaty, M.; Ackaert, B.; Valveny, E.; et al. 2023. Document Understanding Dataset and Evaluation (DUDE). arXiv:2305.08455.
  • Lee et al. (2023) Lee, K.; Joshi, M.; Turc, I. R.; Hu, H.; Liu, F.; Eisenschlos, J. M.; Khandelwal, U.; Shaw, P.; Chang, M.-W.; and Toutanova, K. 2023. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In ICML, 18893–18912.
  • Li et al. (2023) Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023. BLIP-2: Bootstrap** Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML.
  • Li et al. (2020) Li, M.; Xu, Y.; Cui, L.; Huang, S.; Wei, F.; Li, Z.; and Zhou, M. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis. In COLING, 949–960.
  • Lin et al. (2014) Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft COCO: Common Objects in Context. In ECCV, 740–755.
  • Liu et al. (2023a) Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023a. Visual Instruction Tuning. arXiv:2304.08485.
  • Liu et al. (2023b) Liu, Y.; Li, Z.; Li, H.; Yu, W.; Huang, M.; Peng, D.; Liu, M.; Chen, M.; Li, C.; **, L.; et al. 2023b. On the Hidden Mystery of OCR in Large Multimodal Models. arXiv:2305.07895.
  • Longpre et al. (2023) Longpre, S.; Hou, L.; Vu, T.; Webson, A.; Chung, H. W.; Tay, Y.; Zhou, D.; Le, Q. V.; Zoph, B.; Wei, J.; et al. 2023. The FLAN collection: Designing Data and Methods for Effective Instruction Tuning. arXiv:2301.13688.
  • Loshchilov and Hutter (2017) Loshchilov, I.; and Hutter, F. 2017. Decoupled Weight Decay Regularization. arXiv:1711.05101.
  • Lu et al. (2022) Lu, P.; Mishra, S.; Xia, T.; Qiu, L.; Chang, K.-W.; Zhu, S.-C.; Tafjord, O.; Clark, P.; and Kalyan, A. 2022. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In NeurIPS.
  • Lu et al. (2021) Lu, P.; Qiu, L.; Chen, J.; Xia, T.; Zhao, Y.; Zhang, W.; Yu, Z.; Liang, X.; and Zhu, S.-C. 2021. IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. In NeurIPS.
  • Masry et al. (2022) Masry, A.; Do, X. L.; Tan, J. Q.; Joty, S.; and Hoque, E. 2022. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In ACL Findings, 2263–2279.
  • Mathew et al. (2022) Mathew, M.; Bagal, V.; Tito, R.; Karatzas, D.; Valveny, E.; and Jawahar, C. 2022. InfographicVQA. In WACV, 1697–1706.
  • Mathew, Karatzas, and Jawahar (2021) Mathew, M.; Karatzas, D.; and Jawahar, C. V. 2021. DocVQA: A Dataset for VQA on Document Images. In WACV, 2200–2209.
  • Mishra et al. (2019) Mishra, A.; Shekhar, S.; Singh, A. K.; and Chakraborty, A. 2019. OCR-VQA: Visual Question Answering by Reading Text in Images. In ICDAR, 947–952.
  • Mishra et al. (2022) Mishra, S.; Khashabi, D.; Baral, C.; and Hajishirzi, H. 2022. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In ACL, 3470–3487.
  • OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
  • Park et al. (2019) Park, S.; Shin, S.; Lee, B.; Lee, J.; Surh, J.; Seo, M.; and Lee, H. 2019. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. In Workshop on Document Intelligence at NeurIPS.
  • Pfitzmann et al. (2022) Pfitzmann, B.; Auer, C.; Dolfi, M.; Nassar, A. S.; and Staar, P. 2022. DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation. In KDD, 3743–3751.
  • Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning Transferable Visual Models from Natural Language Supervision. In ICML, 8748–8763.
  • Šimsa et al. (2023) Šimsa, Š.; Šulc, M.; Uřičář, M.; Patel, Y.; Hamdi, A.; Kocián, M.; Skalickỳ, M.; Matas, J.; Doucet, A.; Coustaty, M.; and Karatzas, D. 2023. DocILE Benchmark for Document Information Localization and Extraction. arXiv:2302.05658.
  • Singh et al. (2019) Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; Batra, D.; Parikh, D.; and Rohrbach, M. 2019. Towards VQA Models That Can Read. In CVPR, 8317–8326.
  • Sun et al. (2021) Sun, H.; Kuang, Z.; Yue, X.; Lin, C.; and Zhang, W. 2021. Spatial Dual-Modality Graph Reasoning for Key Information Extraction. arXiv:2103.14470.
  • Tanaka et al. (2023) Tanaka, R.; Nishida, K.; Nishida, K.; Hasegawa, T.; Saito, I.; and Saito, K. 2023. SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images. In AAAI, 13636–13645.
  • Tanaka, Nishida, and Yoshida (2021) Tanaka, R.; Nishida, K.; and Yoshida, S. 2021. VisualMRC: Machine Reading Comprehension on Document Images. In AAAI, 13878–13888.
  • Tüselmann et al. (2022) Tüselmann, O.; Müller, F.; Wolf, F.; and Fink, G. A. 2022. Recognition-free Question Answering on Handwritten Document Collections. In ICFHR, 259–273.
  • Wang et al. (2021) Wang, B.; Li, G.; Zhou, X.; Chen, Z.; Grossman, T.; and Li, Y. 2021. Screen2words: Automatic mobile UI summarization with multimodal learning. In UIST, 498–510.
  • Wei et al. (2021) Wei, J.; Bosma, M.; Zhao, V. Y.; Guu, K.; Yu, A. W.; Lester, B.; Du, N.; Dai, A. M.; and Le, Q. V. 2021. Finetuned language models are zero-shot learners. In ICLR.
  • Xu et al. (2020) Xu, Y.; Li, M.; Cui, L.; Huang, S.; Wei, F.; and Zhou, M. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD, 1192–1200.
  • Xu et al. (2021) Xu, Y.; Xu, Y.; Lv, T.; Cui, L.; Wei, F.; Wang, G.; Lu, Y.; Florêncio, D. A. F.; Zhang, C.; Che, W.; Zhang, M.; and Zhou, L. 2021. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In ACL/IJCNLP, 2579–2591.
  • Xu, Shen, and Huang (2023) Xu, Z.; Shen, Y.; and Huang, L. 2023. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. In ACL, 11445–11465.
  • Ye et al. (2023a) Ye, J.; Hu, A.; Xu, H.; Ye, Q.; Yan, M.; Dan, Y.; Zhao, C.; Xu, G.; Li, C.; Tian, J.; Qi, Q.; Zhang, J.; and Huang, F. 2023a. mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding. arXiv:2307.02499.
  • Ye et al. (2023b) Ye, Q.; Xu, H.; Xu, G.; Ye, J.; Yan, M.; Zhou, Y.; Wang, J.; Hu, A.; Shi, P.; Shi, Y.; Jiang, C.; Li, C.; Xu, Y.; Chen, H.; Tian, J.; Qi, Q.; Zhang, J.; and Huang, F. 2023b. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178.
  • Zhang et al. (2023) Zhang, Y.; Zhang, R.; Gu, J.; Zhou, Y.; Lipka, N.; Yang, D.; and Sun, T. 2023. LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding. arXiv:2306.17107.
  • Zhu et al. (2023) Zhu, D.; Chen, J.; Shen, X.; Li, X.; and Elhoseiny, M. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592.
  • Zhu et al. (2022) Zhu, F.; Lei, W.; Feng, F.; Wang, C.; Zhang, H.; and Chua, T.-S. 2022. Towards Complex Document Understanding by Discrete Reasoning. In ACMM, 4857–4866.

Appendix A Further InstructDoc Details

Dataset Document type Held-out (split) # Held-in instances # Held-out instances
DocILE (Šimsa et al. 2023) Invoice 56,369 -
KLC (Borchmann et al. 2021) Report 13,449 -
Deepform (Borchmann et al. 2021) Form 3,409 -
FUNSD (Jaume, Ekenel, and Thiran 2019) Form ✓(test) - 2,270
PWC (Borchmann et al. 2021) Article 2,853 -
Wildreceipt (Sun et al. 2021) Receipt 16,188 -
CORD (Park et al. 2019) Receipt ✓(test) - 1,336
SROIE (Huang et al. 2019) Receipt 2,503 -
VisualMRC (Tanaka, Nishida, and Yoshida 2021) Web-page 21,015 -
WebSRC (Chen et al. 2021) Web-page 307,315 -
OCRVQA (Mishra et al. 2019) Book cover 798,245 -
DocVQA (Mathew, Karatzas, and Jawahar 2021) Business document 39,463 -
HW-SQuAD (Tüselmann et al. 2022) Hand-written document 67,887 -
TAT-DQA (Zhu et al. 2022) Report 13,251 -
WTQ (Borchmann et al. 2021) Table 14,224 -
IconQA (Lu et al. 2021) Diagram 29,859 -
AI2D (Kembhavi et al. 2016) Diagram 12,413 -
ScienceQA (Lu et al. 2022) Diagram 12,726 -
TextbookQA (Kembhavi et al. 2017) Diagram 6,501 -
InfographicVQA (Mathew et al. 2022) Infographic ✓(dev) - 2,801
ChartQA (Masry et al. 2022) Chart ✓(test) - 2,500
SlideVQA (Tanaka et al. 2023) Slide deck ✓(test) - 2,215
DUDE (Landeghem et al. 2023) Open-domain ✓(dev) - 6,315
TabFact (Borchmann et al. 2021) Table ✓(dev) - 12,740
LLaVAR (Zhang et al. 2023) 8 types documents 19,800 -
RVL-CDIP (Harley, Ufkes, and Derpanis 2015) Business document 320,000 -
SciCap (Hsu, Giles, and Huang 2021) Figure 75,494 -
Screen2Words (Wang et al. 2021) Mobile UI 15,743 -
DocBank (Li et al. 2020) Article 3,999,811 -
DocLaynet (Pfitzmann et al. 2022) Report 69,102 -
Table 7: Datasets used in InstructDoc.

Dataset Collection

Dataset list.

Table 7 shows the detail of all datasets we used in InstructDoc. It contains 5,917,602 held-in instances and 30,177 held-out instances.

Query rephrasing.

Table 8 shows the detail of the query rephrasing annotation. The rephrased queries are more easily understandable phrases than the original queries.

Instruction annotation.

Table 9-18 show the examples of instructions for each task in InstructDoc.

Dataset Analysis

Starting words of the instructions.

Refer to caption
Figure 7: Distribution of first three words of the instructions.

Figure 7 shows the sunburst pattern of the first three words of the instructions. It can be seen that the instructions contain various types, such as questions (e.g., “What is the”) and requests (e.g., “I want to”) used in real-world situations.

Answer styles.

Figure 8 shows InstructDoc has five diverse answer types.

Word clouds.

Figure 9 shows how diverse the vocabulary space is in InstructDoc.

Refer to caption
Figure 8: Distribution of the answer styles in QA datasets.
Refer to caption
(a) Word cloud of instructions.
Refer to caption
(b) Word cloud of answers.
Refer to caption
(c) Word cloud of document images.
Figure 9: Statistics of tokens in the instruction, answer, document images.

Appendix B Further Evaluation Setup Details

Main Evaluation Datasets Details

FUNSD.

Form Understanding in Noisy Scanned Documents (FUNSD) (Jaume, Ekenel, and Thiran 2019) evaluates on the KIE task: predicting the entity, “title”, “key”, “value”, or “other”, for the assigned text token.

CORD.

Consolidated Receipt Dataset for Post-OCR Parsing (CORD) (Park et al. 2019) is the KIE dataset with 30 labels under 4 categories such as “total” or “subtotal”.

InfographicVQA.

This dataset focuses on the task of single-page QA w/ discrete & visual reasoning on infographics. It requires understanding plots/graphs, texts, and layout (Mathew et al. 2022).

ChartQA.

This dataset focuses on the task of single-page QA w/ discrete & visual reasoning on chart images. We used both two subsets: (i) machine-generated set and (ii) human-written set (Masry et al. 2022).

TabFact.

This dataset studies the task of Document NLI with semi-structured evidence over tables. It predicts the entailment relationship between two sentences in a document (Borchmann et al. 2021).

DUDE.

Document Understanding Dataset and Evaluation (DUDE) (Landeghem et al. 2023) focuses on the task of multi-page QA w/ discrete & visual & multi-hop reasoning. It is a multi-page, multi-domain, and multi-industry Document VQA for real-world document understanding.

SlideVQA.

This dataset focuses on the task of multi-page QA w/ discrete & visual & multi-hop reasoning on the slide deck composed of multiple images. It requires selecting a set of evidence and answering the question (Tanaka et al. 2023).

Other Evaluation Datasets Details

VisualMRC.

Visual Machine Reading Comprehension (VisualMRC) (Tanaka, Nishida, and Yoshida 2021) is the task of abstractive single-page QA on the Web screenshot. We used the end-to-end setting where answers are derived from OCR results and images without ROI detection.

TextVQA.

It contains scene images from Open Images dataset (Kuznetsova et al. 2020), with questions asking to reason about text in the image (Singh et al. 2019).

ST-VQA.

It contains scene images from multiple sources, such as Visual Genome (Krishna et al. 2017). We used the Open Dictionary setting where answer candidates and vocabularies are not provided at test time (Biten et al. 2019).

Dataset Before \longrightarrow After
header \longrightarrow title
FUNSD question \longrightarrow key
answer \longrightarrow value
menu.nm \longrightarrow menu_name
menu.num \longrightarrow menu_id
menu.unitprice \longrightarrow menu_unitprice
menu.cnt \longrightarrow menu_quantity
menu.discountprice \longrightarrow menu_discountprice
menu.price \longrightarrow menu_price
menu.itemsubtotal \longrightarrow menu_price_discount_applied
menu.vatyn \longrightarrow menu_whether_price_tax_includes
menu.etc \longrightarrow menu_etc
menu.sub_nm \longrightarrow submenu_name
menu.sub_unitprice \longrightarrow submenu_unitprice
menu.sub_cnt \longrightarrow submenu_quantity
menu.sub_price \longrightarrow submenu_price
menu.sub_etc \longrightarrow submenu_etc
CORD void_menu.nm \longrightarrow voidmenu_name
void_menu.price \longrightarrow voidmenu_price
sub_total.subtotal_price \longrightarrow subtotal_price
sub_total.discount_price \longrightarrow subtotal_discount_price
sub_total.service_price \longrightarrow subtotal_service_price
sub_total.othersvc_price \longrightarrow subtotal_chargeprice
sub_total.tax_price \longrightarrow subtotal_tax_price
sub_total.etc \longrightarrow subtotal_etc
total.total_price \longrightarrow total_price
total.total_etc \longrightarrow total_etc
total.cashprice \longrightarrow total_cashprice
total.changeprice \longrightarrow total_changeprice
total.creditcardprice \longrightarrow total_creditcardprice
total.emoneyprice \longrightarrow total_emoneyprice
total.menutype_cnt \longrightarrow total_menutype_count
total.menuqty_cnt \longrightarrow total_menuquantity_count
Table 8: Query rephrasing in KIE datasets. Before/After denotes the original/rephrased queries.

Input

[Uncaptioned image] There are 30 categories for selection:

“menu_name“, “menu_id“, ”menu_unitprice”, “menu_quantity”,

“menu_discountprice”, “menu_price”,“menu_price_discount_applied”,

“menu_whether_price_tax_includes”, ”menu_etc”, “submenu_name”,

“submenu_unitprice”, ”submenu_quantity”, “submenu_price”,“submenu_etc”,

“voidmenu_name”, ”voidmenu_price”, “subtotal_price”, “subtotal_discount_price”,

“subtotal_service_price”, “subtotal_chargeprice”, “subtotal_tax_price”,

“subtotal_etc”, “total_price”, “total_etc”, ”total_cashprice”,

“total_changeprice”, “total_creditcardprice”, “total_emoneyprice”,

”total_menutype_count”, and “total_menuquantity_count”.

Please output the category corresponding to the text “16,500”.

Target

menu_price

Table 9: Example input and target for CORD dataset.

Input

[Uncaptioned image] Respond to the question “What are the LS and LD commands used for?” with an abstractive answer based on the content of the document.

You should answer with a natural language text with the correct information extracted strictly from the document.

Target

It is used to list the directory in long form and see all of its information.

Table 10: Example input and target for VisualMRC dataset.

Input

[Uncaptioned image] You must answer to the question “what is the difference in attendance between the first two entries?” after reading and comprehending the document.

Answers can be either a specific span of the document or a numerical value (such as “2,” “2.5,” “2%,” etc.).

Numerical questions require counting or summing values.

Target

7824

Table 11: Example input and target for WTQ dataset.

Input

[Uncaptioned image] Answer shortly the following question “Which is the most narrow in this diagram?” based on the document.

Please select one answer from the following options:

“Elliptical”,

“Linear”,

“Deltoid”,

“Oval”

Target

Linear

Table 12: Example input and target for AI2D dataset.

Input

[Uncaptioned image] There is some text on the image.

Provide your short answer to the question “Which three business types is Pinterest good for?” based on the document.

The answer type can be divided following types:

- Directly extracting the words from the image

- a list of spans inside of document (each span should be separated by “,”)

- not exist explicitly as a span of document

Target

Interior design, Restaurants, Wedding venues

Table 13: Example input and target for InfographicVQA dataset.

Input

[Uncaptioned image] Respond to the question “What is the combined percentage of Album Sales % and Song Sales % for the genre with a 9% Share of Total Activity?” with a short answer based on the content of the multi-page document.

If you need mathematical reasoning, you first extract numbers and then calculate them to answer the question.

Answers contain either:

- a span inside of document

- a list of spans inside of document (each span should be separated by “,”)

- not exist explicitly as a span of document (the answer should be freely generated text)

Directly answer the question from the document with 1 to 3 words as possible.

Target

23%

Table 14: Example input and target for SlideVQA dataset.

Input

[Uncaptioned image] I want to know a short answer to the question.

Can we conclude that “during the third round of the turkish cup , there be no new entry during that stage”, based on the content of the given document?

Answers are selected from given options “yes” or “no”, and you should not output other than the selected option.

Target

yes

Table 15: Example input and target for TabFact dataset.

Input

[Uncaptioned image] Pick up a suitable class from following classes:

“letter”, “form”, “email”, “handwritten”, “advertisement”, “scientific report”, “scientific publication”, “specification”, “file folder”, “news article”, “budget”, “invoice”, ”presentation”, “questionnaire”, “resume”, and “memo”.

Target

handwritten

Table 16: Example input and target for RVL-CDIP dataset.

Input

[Uncaptioned image] Share your thoughts by using a few words on the content of the image.

Target

Page with search bar to search hotel.

Table 17: Example input and target for Screen2Words dataset.

Input

[Uncaptioned image] Please detect all regions that meet following classes:

“Abstract”, “Author”, “Caption”, “Date”, “Equation”, “Figure”, ”Footer”, ”List”, “Paragraph”, “Reference”, “Section”, “Table”, and “Title”.

Target

Picture [750, 29, 954, 94] Section-header [209, 120, 376, 139] Table [151, 191, 850, 702] Picture [521, 731, 837, 946] Picture [158, 733, 485, 955] Page-footer [897, 962, 911, 972]

Table 18: Example input and target for DocLaynet dataset.